spark final big assignment rdd programming elementary practice

1. Description of requirements

This experiment requires: system: linux unbuntu14.04, processor: at least two prescriptors, one kernel, memory: at least 4G, hard disk space: size needs 20GB. Hadoop: 2.7.1 or above, JDK: 1.8 or above, Spark: 2.4.0 or above, Python: 3.6 or above.

1Analyzing the performance of a university's computer science department based on the data

(1) The total number of students in the department;

(2) How many courses are offered in the department;

(3) What was the overall grade point average of Tom's classmates;

(4) Find the number of courses taken by each student;

(5) How many students are enrolled in the DataBase course in the department;

(6) What is the grade point average for each course;

(7) Use the accumulator to calculate how many people have taken the course DataBase.

2Write stand-alone applications for data de-duplication

For two input files A and B, write a Spark standalone application to merge the two files and eliminate the duplicates to get a new file C

3Write a stand-alone application to implement the averaging problem

Each input file represents the grades of the students of the class in a particular subject, and each line consists of two fields, the first being the student's name and the second being the student's grade; a Spark standalone application is written to find out the average grade of all the students and output it to a new file

2. Introduction to the environment

Environmental Preparation:

HadoopDownload & Installation

/share/init?surl=mUR3M2U_lbdBzyV_p85eSA

(Extract code: 99bg) After entering this Baidu cloud disk link, find the Hadoop installation file hadoop-2.7.

2After downloading, you still need to configure the necessary work to install hadoop.

(1)First create the Hadoop user, thesudo useradd -m hadoop -s /bin/bash

(2)Setting the Hadoop user password, thesudo passwd Hadoop

(3)Add administrator privileges for Hadoop users, thesudo adduser hadoop sudo

(4)The apt needs to be updated after logging in with the Hadoop user，sudo apt-get update

(5)Installing vim, the sudo apt-get install vim

(6)Installation of JAVA environment

sudo apt-get install openjdk-7-jre openjdk-7-jdk

(7) After installing OpenJDK, you need to find the appropriate installation path, which is used to configure the JAVA_HOME environment variable.

dpkg -L openjdk-7-jdk | grep '/bin/javac'

(8) Next, you need to configure the JAVA_HOME environment variable. For convenience, we set it in ~/.bashrc.sudo vim ~/.bashrc

(9) At the top of the file, add a separate line as follows (note that there can be no space before or after the = sign), change the "JDK installation path" to the path obtained from the above command, and save:

(10)Refresh the environment variables.source ~/.bashrc

(11)To install Hadoop, we choose to install Hadoop into /usr/local/.

sudo tar -zxf ~/Download /hadoop-2.6. -C /usr/local

# Extract to /usr/local

cd /usr/local/

sudo mv ./hadoop-2.6.0/ ./hadoop # Change the folder name to hadoop

sudo chown -R hadoop ./hadoop # Modify file permissions

(12) Hadoop Unzip it and use it. Enter the following command to check whether Hadoop is available or not, and the Hadoop version information will be displayed if it succeeds:

cd /usr/local/hadoop

./bin/hadoop version

SparkDownload and Installation

(1)SparkOfficial Download Address:/

(2) Here is the Spark installation in Local mode (standalone mode). We choose Spark version 1.6.2 and assume that we are currently logged into a Linux operating system using the username hadoop.

sudo tar -zxf ~/Download /spark-1.6. -C /usr/local/ #decompression (in digital technology)

cd /usr/local

sudo mv ./spark-1.6.2-bin-without-hadoop/ ./spark #Moving files

sudo chown -R hadoop:hadoop ./spark #The hadoop here is authorized for your username.

(3) After installation, you also need to modify the Spark configuration file

cd /usr/local/spark

cp ./conf/ ./conf/

(4) Edit the file (vim . /conf/) and add the following configuration information on the first line.

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)

With the above configuration information, Spark can store data into Hadoop Distributed File System HDFS, and can also read data from HDFS. If you do not configure the above information, Spark can only read and write local data, can not read and write HDFS data. Configuration is complete, you can use it directly, do not need to run startup commands like Hadoop.

3. Description of data sources

The data comes from the final big assignment material, respectively ，，，，， six files, because the school system does not support bidirectional copy and paste (support bidirectional copy and paste skip), so in the windows system using FileZilla software to transfer to the virtualbox virtual machine, the specific steps are as follows:

Set the VM network to bridge mode and open the VM terminal and type ifconfig, check the local ip and copy it, if not please refresh the network and retry.
Open FileZilla, open the file, site manager, new site, host enter the ip address you just copied, username enter hadoop, password enter your virtual machine hadoop user under the password, click connect.

If the connection is successful, you can transfer files.

4、Data upload and upload results view

5. Description of the data processing process

pysparkinteractive programming

1. Total number of students in the department.





lines=(“file:///usr/local/spark/sparksqldata/data.txt”)//gaindata.txt file



res = lines.map(lambda x:(“,”)).map(lambda x: x[0]) //Get the first row of each1columns




sum = ()// Distinct de-duplication




sum.count()//Take the total number of elements265

2. How many courses are offered in the department;





lines = (“file:///usr/local/spark/sparksqldata/data.txt”) //gaindata.txt file



res = lines.map(lambda x:(“,”)).map(lambda x:x[1]) //Get the first row of each2columns



dis_res = ()//Distinct de-duplication



dis_res.count()//Take the total number of elements8

3. What is the overall grade point average of Tom's classmates;





lines=(“file:///usr/local/spark/sparksqldata/data.txt”) //gaindata.txt file



res = lines.map(lambda x:(“,”)).filter(lambda x:x[0]==”Tom”) //Filtering out information about Tom's classmates' grades



(print)//cyclic output




score = (lambda x:int(x[2]))//Extract each grade of Tom's classmates and convert it to int type



num = res.count() //Number of courses taken by Tom's classmates




sum_score = (lambda x,y:x+y) //Tom's overall grade



avg = sum_score/num // overall performance/doors=average score



print(avg)//Output average score

4. Find the number of courses taken by each student;





lines=(“file:///usr/local/spark/sparksqldata/data.txt”) //gaindata.txt file



res = lines.map(lambda x:(“,”)).map(lambda x:(x[0],1)) //Each course for a student corresponds to (Student Name,1), and a student with n courses has n (student name,1)



each_res = (lambda x,y: x+y) //Get the total number of classes taken per student by student name



each_res.foreach(print)//cyclic output

5.How many students are enrolled in the DataBase course in the department;





lines=(“file:///usr/local/spark/sparksqldata/data.txt”) //gaindata.txt file



res=lines.map(lambdax:(“,”)).filter(lambda x:x[1]==”DataBase”)



res.count()//utilizationcountstatisticians

6. What is the grade point average for each course;





lines=(“file:///usr/local/spark/sparksqldata/data.txt”) //gainData.txt file



res=lines.map(lambdax:(“,”)).map(lambdax:(x[1],(int(x[2]),1))) //Add a new column for each course after the score1expressed1of students opted for the program.



temp = (lambda x,y:(x[0]+y[0],x[1]+y[1])) //Aggregates the total number of course points and the number of people taking the course by course name. Format like ('ComputerNetwork', (7370, 142))



avg = (lambda x:(x[0], round(x[1][0]/x[1][1],2)))//Total Course Points/Number of electors= Average the scores and use round(x, the2) retain two decimal places



(print)//cyclic output

7. Use the accumulator to calculate how many people have taken the DataBase course.





lines=(“file:///usr/local/spark/sparksqldata/data.txt”) //gaindata.txt file



res=lines.map(lambdax:(“,”)).filter(lambda x:x[1]==”DataBase”)//Filtering out data from DataBase courses that were taken



accum = (0) //Define a value from the0Starting accumulator accum



(lambda x:accum.add(1))//Iterate over res, and for each scan of data, the accumulator adds1




accum.value //Output the final value of the accumulator1764

Write stand-alone applications for data de-duplication

1, import SparkContext package

2、 Initialize SparkContext

3. Load two documents A and B

4. Use union to merge the contents of two files.

5, the use of distinct de-emphasis operation

6, using sortBy sorting operation

7, the results will be written to the results file, the role of repartition (1) is to allow the results to be merged into a file, do not add the results will be written to two files





from pyspark import SparkContext



sc=SparkContext('local','sparksqldata')




lines1 = ("file:///usr/local/spark/sparksqldata/")




lines2 = ("file:///usr/local/spark/sparksqldata/")




lines = lines1.union(lines2)



dis_lines=lines.distinct()



res = dis_lines.sortBy(lambda x:x)



(1).saveAsTextFile("file:///usr/local/spark/sparksqldata/result")

Total 500 rows of data, screenshots up to the first page.

Write a standalone application to implement the averaging problem





from pyspark import SparkContext



sc = SparkContext("local","sparksqldata")




lines1 = ("file:///usr/local/spark/sparksqldata/")




lines2 = ("file:///usr/local/spark/sparksqldata/")




lines3 = ("file:///usr/local/spark/sparksqldata/")




lines = lines1.union(lines2).union(lines3)




data = lines.map(lambda x:(" ")).map(lambda x:(x[0],(int(x[1]),1)))



res = data.reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1]))




data = (lambda x:(x[0],round(x[1][0]/x[1][1],2)))




data.repartition(1).saveAsTextFile("file:///usr/local/spark/sparksqldata/result1")

6. Summary of experience

SparkSpark is a fast memory-based unified analytic engine for large-scale data processing. spark has fault-tolerant, parallel features. spark is developing rapidly, the framework is more flexible and practical than hadoop. Reduced latency processing, improve performance efficiency practical flexibility, but also with hadoop practical combination with each other. rdd (resilient distributed dataset) is the core data model of Spark, but also an abstract collection of elements that contain data. Elastomer now RDD data is stored in memory by default, if the memory can not be stored in memory, spark will automatically write the data in the RDD to disk.

After this final assignment to deepen the impression of pyspark, experiments in the use of programming the calculation of data, first of all, first create RDD, and then use the Map method to split each row of records, take out each row of a certain element, and then use the method to realize the results. count method to calculate the total number, distinct method to remove duplicates, round method to retain decimals, and so on. There are many other methods in the future still need to continue to learn, to be able to achieve the degree of flexibility.

In this big assignment, I found that there are still many deficiencies in spark and RDD programming, I don't have a deep enough understanding of RDD, and there are still a lot of deficiencies in the practical use of the code, so I still need to study hard in the future.

bibliography

[1]Hadoop3.1.3Installation Tutorial_Standalone/Pseudo-distributed Configuration_Hadoop3.1.3/Ubuntu18.04(16.04)_Blog of Database Laboratory of XU [J/OL]./blog/2441-2/.
[2] Spark Installation and Programming Practices (Spark 2.4.0) [J/OL]./blog/2501-2/