data analysisThe weapons:NumpyThe pandas and Matplotlib libraries.
- 2.1 Numpy Basics
- 2.1.1 Numpy and Arrays
- 2.1.2 Differences between Numpy arrays and lists
- 2.1.3 Several ways to create an array
- 2.2 Fundamentals of pandas
- 2.2.1 Creation of a DataFrame, a two-dimensional data form
- 2.2.2 Reading and writing of files such as Excel (important)
- 2.2.3 Data reading and screening (important)
- 2.2.4 Data table splicing
- 2.3 Matplotlib Data Visualization Basics
- 2.3.1 Basic drawing
- 2.3.2 Common Tips for Data Visualization
- 2.4 Comprehensive Case Study - Stock Data Reading and K-Line Charting
- Library updates
- 2.4.1 Preliminary Attempts - Stock Data Reading and Visualization
- 2.4.2 Comprehensive Practice - Stock K-Line Charting
- 2.5 Curriculum-related resources
Python is powerful because it provides a lot of efficient and convenient data analysis toolkit, this chapter will change to explain the data analysis of the three common weapons Numpy,pandaswith the Matplotlib library, where the Numpy library is the basis of the pandas library, they are mainly used to process one and two-dimensional tabular data, while the Matplotlib library which is thedata visualizationAt the end of this chapter, we will conduct a comprehensive case study: stock data reading and visual analysis.
This chapter is more extensive and is covered in the author's first book, Python Finance Bigdata miningand analysis of the whole process of detail" on the basis of the enrichment of a lot of content, recommended that readers and friends of the first time to read can first have a general impression, in the subsequent study encountered unclear related knowledge points, and then come back to consult this chapter. In addition, after the completion of all the studies in this chapter, you can draw the stock K-line chart shown in the figure below through these weapons of data analysis.
2.1 Numpy Basics
Before learning the pandas library, you first have to understand the Numpy library, which is short for Numerical Python, and is the basis of the pandas library, the Numpy knowledge we need to master is not complicated, mainly to pave the way for the subsequent pandas learning. If you are utilizing theAnacondaPython installed, then Anaconda comes with the Numpy library, no need to install it separately.
2.1.1 Numpy and Arrays
Numpy library is the main feature is the introduction of the concept of arrays, arrays in fact, and before learning the list is a bit similar, here through the list to the initial recognition of the basic concept of arrays. First of all, the introduction of the Numpy library, its introduction is usually written as import numpy as np, so that you can use np instead of numpy, and then write a more concise, the code is as follows:
import numpy as np
a = [1, 2, 3, 4]
b = np.array([1, 2, 3, 4])
print(a)
print(b)
print(type(a)) # Print the category of a
print(type(b)) # Print the category of b
- 1
- 2
- 3
- 4
- 5
- 6
- 7
which ** (list)** for the generation of an array of a way, array is the meaning of the array, the output is as follows:
[1, 2, 3, 4] # List presentation
[1 2 3 4] # Array presentation
<class 'list'> # a's category is a list
<class ''> # b's class is an array
- 1
- 2
- 3
- 4
The next step is to access the elements in the list and array by list index and array index, the code is as follows:
print(a[1])
print(b[1])
print(a[0:2])
print(b[0:2])
- 1
- 2
- 3
- 4
The output is as follows:
2 # List a results of index calls
2 # Results of array b index calls
[1, 2] # The result of slicing list a. Note the list slicing, left-closed-right-open
[1 2] # The result of slicing array b. Array slicing, also left-closed-right-open
- 1
- 2
- 3
- 4
From the above results you can see that lists and arrays have the same indexing mechanism, the only difference seems to be that the elements in the array are separated by spaces, while the list uses commas.
2.1.2 Numpy arraysDifferences from lists
From the above analysis that Numpy arrays and lists are very similar, then why Python and create a Numpy library? There are many reasons, here are two main points:
Arrays make it easier to do some math, while lists are more cumbersome;
Arrays can support multiple dimensions of data, whereas lists can usually only store one dimension of data.
For the first point, Numpy as a data processing library can be very good support for some mathematical operations, while the list is more troublesome to the following code as a demonstration:
c = a * 2
d = b * 2
print(c)
print(d)
- 1
- 2
- 3
- 4
The results of the run are as follows:
[1, 2, 3, 4, 1, 2, 3, 4]
[2 4 6 8]
- 1
- 2
You can see that also by multiplying, the list is duplicating the elements, while the array is doing the math.
To address the second point, lists store one-dimensional data, while arrays can store multidimensional data. Some readers may wonder what it means to have one-dimensional data or multi-dimensional data. Here one-dimensional and multi-dimensional and three-dimensional geometry is more similar to one-dimensional similar to a straight line, multi-dimensional is similar to a plane (two-dimensional) or three-dimensional (three-dimensional) and so on. Like a list is like a straight line, can be seen as a one-dimensional, and the usual use of Excel table data can be seen as a two-dimensional data. The following code as an example:
e = [[1,2], [3,4], [5,6]] # The elements in the list are small lists
f= np.array([[1,2], [3,4], [5,6]]) # One way to create a two-dimensional array
- 1
- 2
The printout of list e and array f is shown below:
[[1, 2], [3, 4], [5,6]] # This is the printout of list e
[[1 2]
[3 4]
[5 6]]
- 1
- 2
- 3
- 4
You can see the list contains three small list, but it is still a one-dimensional structure, and the creation of a two-dimensional array is a two-dimensional structure of three rows and two columns of the content, this is also the core content of the study of the pandas library after the data data processing is often used in two-dimensional arrays, that is, the two-dimensional form structure.
2.1.3 Several ways to create an array
Above we have touched on one way to create an array, that is to create a list by means of (list), this side of the simple summary, the code is as follows:
# Create one-dimensional arrays
b= np.array([1, 2, 3, 4])
# Create two-dimensional arrays
f= np.array([[1,2], [3,4], [5,6]])
- 1
- 2
- 3
- 4
In addition, there are some common ways to create arrays, here to one-dimensional arrays as an example, we can also use the () function to generate one-dimensional arrays, where the brackets can be selected 1 or 2 or 3 parameters, the code is as follows:
# One parameter Parameter value is the end point, the start point takes the default value0The step size takes the default value1
x = np.arange(5)
# Two parameters The first parameter is the start point, the second is the end point, and the step size takes the default value1Left closed, right open
y= np.arange(5,10)
# Three parameters The first parameter is the starting point, the second is the ending point, and the third is the step size, left-closed-right-open
z= np.arange(5, 10, 0.5)
- 1
- 2
- 3
- 4
- 5
- 6
Print out the result as follows:
[0 1 2 3 4]
[5 6 7 8 9] # Here, like the list slice, it's also left-closed and right-open #
[5. 5.5 6. 6.5 7. 7.5 8. 8.5 9. 9.5]
- 1
- 2
- 3
We can also create random one-dimensional arrays through the module, for example, you can use (3) to create a one-dimensional array of three random numbers obeying a positive tai distribution (a distribution with mean 0 and variance 1), the code is as follows:
a = np.random.randn(3)
- 1
The printout is shown below:
[-0.02066164 0.42953796 1.17999329]
- 1
If you replace (3) with (3), then the generation is 3 random numbers between 0 and 1. This will be used later in subsection 2.3.1 when demonstrating the plotting of scatterplots.
As for the creation and learning of two-dimensional arrays, you can use one-dimensional arrays in the () function and reshape method to generate a two-dimensional array, such as 0 to 11 numbers into 3 rows and 4 columns of two-dimensional arrays, the code is as follows:
a = np.arange(12).reshape(3,4)
- 1
The printout is shown below:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
- 1
- 2
- 3
Here again a brief mention of a random two-dimensional array creation, the code is as follows:
a = np.random.randint(0, 10, (4, 4))
print(a)
- 1
- 2
which () function is used to create a random integer, the first element in parentheses 0 represents the starting number, the second element 10 represents the termination of the number of the third element (4, 4) means that the generation of a 4 rows and 4 columns of the two-dimensional array, the results are as follows:
[[4 1 6 3]
[3 0 4 8]
[7 8 1 8]
[4 6 3 6]]
- 1
- 2
- 3
- 4
2.2 Fundamentals of pandas
The pandas library is an open source Python library based on NumPy, and is widely used to quickly analyze data, as well as data cleaning and preparation, etc. In a way, you can think of Pandas as the Python version of Excel. In a way, you can think of Pandas as the Python version of Excel. if you are using Anaconda to install Python, then Anaconda comes with the pandas library, so there is no need to install it separately.
Compared to Numpy, Pandas is better at handling two-dimensional data.Pandas has two main data structures: Series and DataFrame.Series is similar to the one-dimensional array generated by Numpy, the difference is that the Series object contains not only values, but also a set of indexes, which are created as follows:
import pandas as pd
s1 = pd.Series(['Ding Yi', 'Wang Er', 'Zhang San'])
- 1
- 2
The result is shown below, it is also a one-dimensional data structure, and for each element there is a row index can be used to locate, for example, you can s1[1] to locate the second element "Wang Er".
0 Ding Yi (1901-1984), Chinese * leader, prime minister 1997-1998
1 Wang Er Er (1962-), second son of King Arthur
2 Zhang San
dtype: object
- 1
- 2
- 3
- 4
Series alone is used relatively little, pandas mainly using DataFrame data structure. dataFrame is a two-dimensional table data structure, intuitively, you can see it as an Excel table.
2.2.1 Creation of a DataFrame, a two-dimensional data form
There are three common ways to create a DataFrame: through a list, through a dictionary and through a two-dimensional array.
(1) Creating a DataFrame from a list
The first step is to create a list, which is similar to creating a two-dimensional array via Numpy in the previous subsection. The way to introduce the pandas library is usually to import pandas as pd, and then call the DataFrame function to create a two-dimensional array.
import pandas as pd
a = pd.DataFrame([[1, 2], [3, 4], [5, 6]])
- 1
- 2
A printout of a is run with the following results:
0 1
0 1 2
1 3 4
2 5 6
- 1
- 2
- 3
- 4
A comparison is made with the two-dimensional array previously generated via Numpy:
[[1 2]
[3 4]
[5 6]]
- 1
- 2
- 3
You can see through the pandas DataFrame function to generate a two-dimensional array more like we see in Excel two-dimensional form data, it also has a row index and column index, where the index number are from 0 to start.
We can also customize its column index and row index name with the following code:
a = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['date', 'score'], index=['A', 'B', 'C'])
- 1
which columns that is the name of the column index, index that is the name of the row index, the output is as follows:
date score
A 1 2
B 3 4
C 5 6
- 1
- 2
- 3
- 4
DataFrame generation through the list can also be used in the following way, the demo code is as follows:
a = pd.DataFrame() # Create an empty DataFrame
date= [1, 3, 5]
score = [2, 4, 6]
a['date'] = date
a['score'] = score
- 1
- 2
- 3
- 4
- 5
Be careful to ensure that the length of the date list and the score list are the same, otherwise an error will be reported, the effect is as follows:
date score
0 1 2
1 3 4
2 5 6
- 1
- 2
- 3
- 4
The method is applied in Chapter 5, subsection 5.2.2 when summarizing the names of the characteristic variables and the importance of the characteristics.
(2) Creating a DataFrame from a Dictionary
In addition to creating a DataFrame from a list, you can also create a DataFrame from a dictionary and customize the column indexes.Dictionary keys are column indexesThe code is as follows:
# Create two-dimensional arrays via Pandas- Dictionary method
b= pd.DataFrame({'a': [1, 3, 5], 'b': [2, 4, 6]}, index=['x', 'y', 'z'])
print(b) # In the Jupyter Notebook editor, you can also type b directly to view the
- 1
- 2
- 3
The output is as follows, you can see that the column index has been changed to the key name in the dictionary here.
a b
x 1 2
y 3 4
z 5 6
- 1
- 2
- 3
- 4
If you want theDictionary keys become row indexesIf you want to convert a dictionary into a DataFrame by using from_dict, you can set the orient parameter to index as well, as shown in the following code:
c = pd.DataFrame.from_dict({'a': [1, 3, 5], 'b': [2, 4, 6]}, orient="index")
print(c)
- 1
- 2
where the orient parameter specifiesDictionary key corresponding to the directionThe default value is columns, if you do not set it to index, the dictionary key is still the default column index, the output is as follows, the dictionary key is already a row index.
0 1 2
a 1 3 5
b 2 4 6
- 1
- 2
- 3
In addition to setting the orient parameter through the from_dict() function, we can also transpose the list through the .T property of the DataFrame, the demo code is as follows:
b = pd.DataFrame({'a': [1, 3, 5], 'b': [2, 4, 6]})
print(b) # Print the original form
print(b.T) # Print the transposed form
- 1
- 2
- 3
The results are as follows, you can see that the same can be transposed through the form of . In addition, note that if you want to change the original form structure, you need to re-assign the value, written as b = , so that it will change the original b of the form structure.
a b
0 1 2
1 3 4
2 5 6
0 1 2
a 1 3 5
b 2 4 6
- 1
- 2
- 3
- 4
- 5
- 6
- 7
(3) Created from a two-dimensional array
Through the two-dimensional array generated by Numpy, you can also create a DataFrame, here to 2.1.3 subsection mentioned in the two-dimensional array as an example of generating a 3-row, 4-column DataFrame, the code is as follows:
import numpy as np
d = pd.DataFrame(np.arange(12).reshape(3,4), index=[1, 2, 3], columns=['A', 'B', 'C', 'D'])
- 1
- 2
The printout is shown below:
A B C D
1 0 1 2 3
2 4 5 6 7
3 8 9 10 11
- 1
- 2
- 3
- 4
Supplemental Knowledge: Modifying Row Index or Column Index Names
Sometimes we want to modify the name of the row index or column index set before, then this can be achieved through the rename () function, first of all, through the above knowledge points to construct a demo DataFrame, the code is as follows:
a = pd.DataFrame([[1, 2], [3, 4]], columns=['date', 'score'], index=['A', 'B'])
- 1
At this point, form a is shown below:
date score
A 1 2
B 3 4
- 1
- 2
- 3
If you want to rename an index, the rename() function is used as follows:
a = a.rename(index={'A':'Ali', 'B':'Tencent'}, columns={'date':'Date','score':'Score'})
- 1
Here by rename after not changing the original form structure, you need to re-assign the value to a to change the original form, or in rename () set the inplace parameter to True, you can also realize the real replacement, the code is as follows:
a.rename(index={'A':'Ali', 'B':'Tencent'}, columns={'date':'Date','score':'Score'}, inplace=True)
- 1
At this point, form a is shown below:
Date Score
Ali (name)1 2
Tencent (name)3 4
- 1
- 2
- 3
The index value at this point can also be viewed through the values property with the following code:
a.index.values
- 1
The printout is as follows, it is a one-dimensional array in array format.
['Ali' 'Tencent']
- 1
If you want to name the row index, you can also do so by using the following code::
a.index.name = 'Company'
- 1
Get the results below:
Date Score
Company
Ali1 2
Tencent (name)3 4
- 1
- 2
- 3
- 4
If you want to make the row index the contents of a column, you can use the set_index() function with the following code:
a = a.set_index('Date')
- 1
At this point, form a is shown below:
mark
Date
1 2
3 4
- 1
- 2
- 3
- 4
Like the rename() function, we can set the replace parameter to True so that we don't have to do reassignments:
a.set_index('Date', inplace=True)
- 1
If you want to replace the row index with a numeric index at this point, you can use the reset_index() function with the following code:
a = a.reset_index()
- 1
At this point, table a becomes a table with numbers as row indexes:
Date Score
0 1 2
1 3 4
- 1
- 2
- 3
The above could also just be = ['xxx', 'xxx']
In addition, if you want to simply change the column names, for example, if there is a DataFrame with columns named "date" and "score", you can do so as follows (in addition, if you want to see the column names of a DataFrame, you can do so via print()). column names of a DataFrame, you can use print() to view them):
As a side note, if you want to modify the column index, you can also set it quickly by using = ['xxx', 'xxx'], the demo code is below:
a = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['date', 'score'], index=['A', 'B', 'C'])
a .columns = ['Date', 'Score']
- 1
- 2
2.2.2 Reading and writing of files such as Excel (important)
With pandas, it is possible to read files from a wide variety of data files and to import the obtained data into these files. This section explains how to read and write files with Excel and CSV files.
(1) File reading
Enter the following code for reading Excel data:
import pandas as pd
data = pd.read_excel('') # data is a DataFrame structure
- 1
- 2
Here the Excel file suffix for xlsx, if it is the 2003 version or before the Excel, the suffix is xls. Here the use of the file path is a relative path, that is, the code where the file path, you can also be set to absolute path (relative path and absolute path of the relevant knowledge points refer to this section of the supplementary knowledge points).
by printing data we can view the form at this time, or we can print () to see the first five lines of the form (if written as head (10) you can view the first 10 lines of data), the code is as follows:
data.head()
- 1
The printout is as follows:
Among them, read_excel can also set parameters, which are used as follows:
data = pd.read_excel('', sheet_name=0)
- 1
A few common parameters are as follows: sheet_name: specify the sheet table, you can enter the name of the sheet, you can also be a number (the default is 0, that is, the first sheet); index_col: set a column set to the row index.
In addition to reading Excel files, pandas can also read CSV files. CSV files are also a file format for storing data, compared to Excel files, CSV files are essentially text files that store data but do not contain formatting, formulas, macros, and so on, so they usually take up less space.CSV files are a series of comma-separated values and can be opened by either Excel or a text editor (Notepad). CSV files are separated by a series of values with commas and can be opened either by Excel or by a text editor (Notepad).
Enter the following code for reading CSV file:
data = pd.read_csv('')
- 1
read_csv can also specify parameters, which are used as follows:
data = pd.read_csv('', delimiter=',', encoding='utf-8')
- 1
which delimiter parameter refers to the CSV file in the delimiter number, the default is a comma; encoding is to set the encoding mode, if there is a Chinese garbled, you need to be set to utf-8 or gbk, you can also set index_col by setting the index column.
(2) File Write
With the following code, you can write data to an Excel file.
# Mr. DataFrame
data= pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['Column A','Column B'])
# Importing DataFrames into Excel
data.to_excel('data_new.xlsx')
- 1
- 2
- 3
- 4
After running it will generate an Excel file named data_new in the folder where the code is located as shown below:
Here the file storage path used is also a relative path, that is, saved in the folder where the code, you can also write an absolute path as needed, the file relative path and absolute path see the following additional knowledge.
In the above table, the first column of the saved Excel also retains the index information, if you want to delete it, you can set the parameter index of to_excel to False. to_excel's common parameters are as follows: sheet_name: data table name; index: True or False, the default is True to save the index information, that is, the output of the first column of the index value of the file. The default is True to save the index information, that is, the output of the first column of the file index value, choose False, then ignore the index information; columns: select the required columns; encoding: encoding mode.
For example to import a data table into an Excel file and ignore the row index information, the code is as follows:
data.to_excel('data_new.xlsx', index=False)
- 1
In a similar way, you can write data to a CSV file with the following code:
data.to_csv('data_new.csv')
- 1
Similar to to_excel, to_csv can also set index, columns, encoding and other parameters. Note that, if there is Chinese garbled code in the exported CSV file, and the encoding parameter is set to "utf-8", you need to set the encoding parameter to "utf_8_sig", the code is as follows:
data.to_csv('demo.csv', index=False, encoding="utf_8_sig")
- 1
Supplementary Knowledge: Relative and Absolute Paths to Files
relative path
Relative path to the file, that is, the folder where the code is located, for example, in the above case, data.to_excel('') is written in the folder where the code is generated Excel file. In addition, if you write data.to_excel('XX folder/'), it means that the Excel file is generated in the "XX folder" under the folder where the code is located.
absolute path
The absolute path to a file is the full path name of the file, for example, 'E:\Big Data Analytics\' is the absolute path, but because in Python the backslash "\" often has a special meaning, for example, "\ n" indicates a line break, so it is usually recommended to write two backslashes when writing an absolute path to cancel the special meaning of a single backslash that may exist, and write 'E:\Big Data Analytics\'.
In addition to using two backslashes to cancel out the special meaning of a backslash, you can also add an r in front of the string of the file path to cancel out the special meaning of a single backslash as well, with the following code:
data.to_excel('E:\\\ Big Data Analytics\\\') # Recommended way to write absolute paths1
data.to_excel(r'E:\ Big Data Analytics\') # Recommended way to write absolute paths2
data.to_excel('E:/Big Data Analytics/') # Recommended way to write absolute paths3
- 1
- 2
- 3
2.2.3 Data reading and screening (important)
After creating a DataFrame, you need to read and filter the data in the two-dimensional form, this subsection will explain the various ways to read data and how to filter the data.
First of all, create a three-line three-column table, line index is set to r1, r2 and r3, column index is set to c1, c2 and c3, as an example to demonstrate the data reading and screening, the code is as follows:
import pandas as pd
data = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3'])
- 1
- 2
Or you can also use 2.2.1 mentioned in the subsection to create the DataFrame in the form of an array, where the number 1 as the starting point, the number 10 as the end point (the end point can not be taken) to generate 1 to 9 nine numbers, the code is as follows:
data = pd.DataFrame(np.arange(1,10).reshape(3,3), index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3'])
- 1
The printouts for both methods are shown below:
c1 c2 c3
r1 1 2 3
r2 4 5 6
r3 7 8 9
- 1
- 2
- 3
- 4
The following will be explained in accordance with the rows and columns of data filtering, data filtering in accordance with specific conditions, the overall view of the data, data operations, sorting and deletion and other commonly used knowledge points.
1. Data filtering by rows and columns
(1) Selection of data by column
The following code allows you to select data by columns, here a single column is selected first.
a = data['c1']
- 1
The printout is shown below:
r1 1
r2 4
r3 7
Name: c1, dtype: int64
- 1
- 2
- 3
- 4
At this time, the return of the results of the table header information is not available, this is because by data['c1'] select a column of the return of a one-dimensional sequence structure of the class, you can also use the following code to return a two-dimensional table data.
b = data[['c1']]
- 1
The printout is shown below:
c1
r1 1
r2 4
r3 7
- 1
- 2
- 3
- 4
If you want to select multiple columns, you need to give a list in the middle bracket [], for example, to read the c1 and c3 columns, you can write data[['c1', 'c3']]. Here you need to pay special attention to is that it must be a list, and can not be data['c1', 'c3'], the code is as follows:
c = data[['c1', 'c3']]
- 1
The printout is shown below:
c1 c3
r1 1 3
r2 4 6
r3 7 9
- 1
- 2
- 3
- 4
(2) Selection of data by row
It can be based onRow numberto make the selection, the code is as follows:
# Selection of the first2until (a time)3rows of data, noting that the serial numbers start at0Start, left closed, right open
a= data[1:3]
- 1
- 2
The printout is shown below:
c1 c2 c3
r2 4 5 6
r3 7 8 9
- 1
- 2
- 3
And pandas recommended the use of iloc method to row selection based on the serial number of the row, it is based on the serial number of the row selection of another method, pandas feel that this is more intuitive, will not be like data[1:3] may cause confusion, the code is as follows:
b = data.iloc[1:3]
- 1
And if you want to select a single line, you have to use iloc, for example, to select the penultimate line, the code is as follows:
c = data.iloc[-1]
- 1
If you use data[-1] at this point, you will get an error, because it may think that the -1 is a column name, leading to a confusing error.
In addition to picking by the serial number of the row, you can also use the loc methodBased on the name of the rowto make the selection, the code is as follows:
d = data.loc[['r2', 'r3']]
- 1
Sometimes if the number of lines is very large, you can use the head () method to select the first 5 lines, the code is as follows:
e = data.head()
- 1
Here since only 3 rows of data are created, all of them will be fetched by (), if you want to fetch only the first two rows of data, you can write (2).
(3) Selection by block
If you want to select a certain number of rows of a certain column, you can do so by using the following code, for example, to obtain the first two rows of columns c1 and c3.
a = data[['c1', 'c3']][0:2] # can also be written as data[0:2][['c1', 'c3']]
- 1
In fact, it is to integrate the above knowledge of selecting data through rows and columns, and the results of the run are as follows:
c1 c3
r1 1 3
r2 4 6
- 1
- 2
- 3
In the real world.A mix of iloc and column picking is often used to pick specific blocks or valuesThe code is as follows:
b = data.iloc[0:2][['c1', 'c3']]
- 1
This effect is the same, and the logic is clear, the code is not easy to confuse: first through the iloc select rows, and then through the column selection select columns, which is also recommended by the official documentation of the use of pandas.
The method is more advantageous if you want to pick a single value, for example, to pick the information in the first row of column c3, you can't write data['c3'][0] or data[0]['c3']. The following writeup is clearer, iloc[0] picks the first row first, then the c3 column.
c = data.iloc[0]['c3']
- 1
iloc[0, 1] # denotes row 1, column 2 cell
It is also possible to select both rows and columns by using the iloc and loc methods with the following code:
d = data.loc[['r1', 'r2'], ['c1', 'c3']]
e = data.iloc[0:2, [0, 2]]
- 1
- 2
Note that the loc method uses strings as indexes to select rows and columns, and the iloc method uses numeric values as indexes to select rows and columns. There is a way to remember, loc is location (location, location) of the abbreviation, so through the character index to locate, and iloc in an additional letter i, and i often represents the value, so iloc method is to use numerical values as an index, the printout is shown below:
c1 c3
r1 1 3
r2 4 6
- 1
- 2
- 3
There is also an ix selection of the region of the method, it can also be selected at the same time rows and columns, and the contents of the contents of the unlike loc or iloc must be a character index or numerical index, the code is as follows:
f = data.ix[0:2, ['c1', 'c3']]
- 1
The logic and effect is the same as [0:2][['c1', 'c3']], but ix is not recommended by pandas anymore, so if you need to use region selection after that, it is still the [0:2][[ 'c1', 'c3']] way.
In addition, to add a knowledge point: iloc [0, 1] that the first row, the second column of cells.
2. Filtering by specific conditions
In square brackets can also filter rows by judging the conditions, such as selecting rows with numbers greater than 1 in column c1, the code is as follows:
a = data[data['c1'] > 1]
- 1
The printout is shown below:
c1 c2 c3
r2 4 5 6
r3 7 8 9
- 1
- 2
- 3
If there are more than one screening condition, it can be through the "&" symbol (that is, "and") or "|" (that is, "or ") connection, such as this side of the screening, c1 columns of figures greater than 1 and c2 columns of figures less than 8 rows, the code is as follows, pay attention to remember to add judgment conditions on both sides of the parentheses.
b = data[(data['c1'] > 1) & (data['c2'] < 8)]
- 1
The printout is shown below:
c1 c2 c3
r2 4 5 6
- 1
- 2
3. Overall view of data
The shape attribute of a table allows you to view the number of rows and columns of the table as a whole, so that you can quickly understand the number of rows and columns of the table when the table has a large amount of data.
data.shape
- 1
The result is as follows, where the first number is the number of rows in the table and the second number is the number of columns in the table.
(3, 3)
- 1
through the form of describe () function can quickly view the form of each column of the number, mean, standard deviation, minimum, 25th percentile, 50th percentile, 75th percentile, the maximum value of information, the code is as follows:
data.describe()
- 1
The running effect is as follows:
value_counts () function you can quickly see what data are in a column, as well as the frequency of the data, the code is as follows:
data['c1'].value_counts()
- 1
The result is as follows, you can see that there are 3 different kinds of data in column c1, and the frequency of each occurrence is 1.
7 1
1 1
4 1
Name: c1, dtype: int64
- 1
- 2
- 3
- 4
These points will have application when viewing movie rating data in section 14.3.2.
4. Data arithmetic, sorting and deletion
(1) Data operations
From the existing column, through the data operation to create a new column, the code is as follows:
data['c4'] = data['c3'] - data['c1']
data.head()
- 1
- 2
The output is shown below, where () is the first 5 rows of the output table, here there are only 3 rows, so all are displayed.
c1 c2 c3 c4
r1 1 2 3 2
r2 4 5 6 2
r3 7 8 9 2
- 1
- 2
- 3
- 4
(2) Sorting of data
With sort_values() you can sort the data according to the columns, for example to sort column c2 in descending order, the code is as follows:
a = data.sort_values(by='c2', ascending=False)
- 1
which by parameter is based on that column to sort, the parameter ascending for the meaning of the ascending, the default parameter is True, set to False, then, it means descending order, the printout is shown below.
c1 c2 c3 c4
r3 7 8 9 2
r2 4 5 6 2
r1 1 2 3 2
- 1
- 2
- 3
- 4
In fact, if it is filtered by column, we can also write directly into the following code, do not write "by =", the same effect:
a = data.sort_values('c2', ascending=False)
- 1
If you want to sort according to multiple columns, it can be achieved by the following code, which means that according to the c2 columns are sorted in descending order first, and if there are duplicate values in the c2 columns, then according to the c1 columns are sorted in descending order.
In addition, with sort_index() it is possible toBased on row indexTo perform sorting, such as ascending order by row index, the code is as follows:
a = a.sort_index()
- 1
After running the code, just generated a table row index and become r1,r2,r3 ascending order. You can also set the ascending parameter to False to make it descending.
This point is applied at the end of Chapter 5, subsection 5.2.2, when looking at the importance of the variables' characteristics in order.
(3) Data deletion
If you first want to delete the data specified in the data, you need to use the drop () function. The specific usage is as follows:
DataFrame.drop(index=None, columns=None, inplace=False)
- 1
Its common parameters are explained as follows: index: specify the rows to be deleted; columns: specify the columns to be deleted; inplace: the default is inplace=False, the deletion operation does not change the original data, but returns a new dataframe after the deletion operation is performed, if you choose inplace=True, the deletion operation will be performed directly on the original data. If place=True is selected, the delete operation will be performed directly on the original data.
For example, to delete the data in column c1, the code is as follows:
a = data.drop(columns='c1')
- 1
To delete multiple columns of data, such as columns c1 and c3, you can declare the columns to be deleted by listing them, with the following code:
b = data.drop(columns=['c1', 'c3'])
- 1
If you want to delete rows of data, such as deleting the first and third rows of data, the code is as follows:
c = data.drop(index=['r1', 'r3'])
- 1
Note that here to enter the name of the line index instead of the number of serial numbers, but if the line index name was originally a number, then you can enter the corresponding number. Delete the data above and then assigned to the new variable will not change the structure of the original form data, if you want to change the structure of the original form, you can make the inplace parameter True, the code is as follows:
data.drop(index=['r1','r3'], inplace=True)
- 1
This knowledge is then applied in Section 11.2 Data Preprocessing Repeated Values and Other Related Content Processing.
2.2.4 Data table splicing
pandas also provides some advanced features: data merging and reshaping, which provides a great convenience for the merger of two forms splicing. Mainly includes merge, concat, append three methods, the following is a simple example of a simple introduction and demonstration of them.
Suppose you have the following two DataFrame tables and need to merge them:
import pandas as pd
df1 = pd.DataFrame({'Company': ['Vanguard', 'Ali', 'Baidu'], 'Score': [90, 95, 85]})
df2 = pd.DataFrame({'Company': ['Vanguard', 'Ali', 'Jingdong'], 'Share price': [20, 180, 30]})
- 1
- 2
- 3
Where df1 and df2 are shown below, the scores here refer to the ratings that analysts have assigned to the company.
(1) merge() function
merge () function based on one or more keys to join the rows in different tables, the example is as follows:
df3 = pd.merge(df1, df2)
- 1
The running effect is as follows:
you can see by merge () function directly select ** the same column name ("Company" column) ** for the merger, and the default selection of the two tables are common to the contents of the columns (Vanke, Ali), sometimes if the same column name is more than one, you can specify through the on parameter in accordance with which the merger of the columns, the code code is as follows:
df3 = pd.merge(df1, df2, on='Company')
- 1
The default merge is actuallyTake the intersection (inner join), i.e., take what is common to both tables, if you want toFetch merge set (outer connection), that is, select all the contents of the two tables, you can set how parameters, the code is as follows:
df3 = pd.merge(df1, df2, how='outer')
- 1
Run the following effect, you can see that all the data are available, the original content is not assigned to the empty value NaN.
If you want to keep the entire contents of the left table and don't care much about the right table, you can set the how parameter to left:
df3 = pd.merge(df1, df2, how='left')
- 1
At this point, df3 is shown in the table below, and the contents of df1 (Vanke, Ali, Baidu) are kept intact.
Similarly, if you want to keep the entire contents of the right table and don't care much about the left table, you can set the how parameter to right.
If you want to merge based on row index, you can do so by setting the left_index and right_index parameters with the following code:
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
- 1
At this point df3 is shown in the following table, and the two tables have been merged according to their row indexes.
Note that in the join () function for splicing, the two tables can not have the same name of the name of the column, if there is, then you need to set lsuffix parameter (the left table with the name of the suffix, suffix of the Chinese translation is the meaning of the suffix, l means left) and rsuffix parameter (the right table with the name of the suffix, where the r means right), there is no the same column name, then you can directly write (df2), compared to merge () function to write some more concise.
In practice, you can just remember the use of merge() function, the purpose of explaining the join() function is to see others use join() function can be understood. The knowledge point in the 14.3.3 subsection of the merger of data tables will have applications.
(2) concat() function
concat () function is a full join (UNION ALL) way, it does not need to align, but directly merge (that is, it does not need some columns or indexes of the two tables are the same, just to integrate the data together). So concat does not have "how" and "on" parameters, but through the "axis" to specify the axis of the connection.
By default, axis=0, joins in row direction.
df3 = pd.concat([df1,df2], axis=0)
- 1
The results are as follows:
At this time, the row index for the original two tables of their respective indexes, if you want to reset the index, you can use 6.2.1 subsection talked about the reset_index () method to reset the index, or in concat() set ignore_index = True, ignore the original index, according to the new digital index for sorting.
If you want to connect in column direction, you can set the axis parameter to 1.
df3 = pd.concat([df1,df2],axis=1)
- 1
The results are as follows:
(3) append() function (important)
append() function can be said to be a simplified version of the concat() function, the effect is similar to ([df1,df2]), the code is as follows:
df3 = df1.append(df2)
- 1
The effect is the same as that produced by ([df1,df2]).
The append() function also has a common function, like list**.append()**, which can be used to add new elements with the following code:
df3 = df1.append({'Company': 'Tencent', 'Score': '90'}, ignore_index=True)
- 1
Here be sure to set ignore_index=True to ignore the original index, otherwise an error will be reported and the resulting df3 is shown below:
Points related to data table splicing will be useful in Section 14.3.2 Movie Data Splicing.
In addition to the above points, in subsection 11.2 we will also explain how to deal with duplicate values through the pandas library, missing values and outliers, in subsection 14.3.3 explain the pivot table function pivot_table () function, as well as in the 14.3 subsection of the supplemental knowledge to explain the pandas library groupby () function of the relevant knowledge points.
2.3 Matplotlib data visualizationinfrastructural
In Python, there is an excellent data visualization library called Matplotlib. If you installed Python through Anaconda, it comes with the library, so you don't need to install it separately. This section explains the basic usage of Matplotlib and some common tips.
2.3.1 Basic drawing
Before drawing the first import matplotlib library, the import method is usually written "import as plt", as is renamed into the library to facilitate the use of the following drawings only need to call plt's corresponding function can be, such as () for the line graphs, () for the histogram, () for the scatter plot, () for the pie chart, () for the histogram, etc., here to the line graph, bar graph, scatter plot and histogram as an example to explain the basic graph drawing method. () for the pie chart, () for the histogram, etc., here to line graphs, bar charts, scatter plots and histograms as an example to explain the basic graphical drawing methods.
(1) Line graphs
With () you can draw a line graph with the following code:
import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [2, 4, 6]
plt.plot(x, y) # Plot line graphs
plt.show() # Show graphics
- 1
- 2
- 3
- 4
- 5
Note that you remember to add () at the end to show the graph, and run the results as shown below:
If you want to make some mathematical relationship between x and y, the list is not very easy to perform mathematical operations, this time you can introduce a one-dimensional array for mathematical operations through the Numpy library described in subsection 2.1.2, the code is as follows:
import numpy as np
import matplotlib.pyplot as plt
x1 = np.array([1, 2, 3])
# First line: y= x + 1
y1 = x1 + 1
plt.plot(x1, y1) # Drawing with default parameters
# Second line: y= x*2
y2 = x1*2
# colorsets the color, linewidth sets the line width in pixels, linestyle defaults to solid, "--"Indicates a dotted line
plt.plot(x1, y2, color='red', linewidth=3, linestyle='--')
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
Here a one-dimensional array x1 is generated through Numpy, and y1 and y2 are generated based on the arrays' computability, both lines are drawn on a graph, and the final run is shown below:
(2) Column charts
With () you can draw a bar chart with the following code:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [5, 4, 3, 2, 1]
plt.bar(x, y)
plt.show()
- 1
- 2
- 3
- 4
- 5
The results of the run are shown below:
(3) Scatterplot
Scatterplot can be plotted through () with the following code:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.rand(10)
y = np.random.rand(10)
plt.scatter(x, y)
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
Here 10 random numbers between 0 and 1 are generated by (10) mentioned in subsection 2.1.3 and the result of the run is shown below:
(4) Histogram
() can be drawn through the frequency or frequency histogram, the so-called histogram is actually a frequency map or frequency map, the horizontal axis labeled as the relevant data, the vertical coordinate is the frequency of the data or frequency, the demonstration code is as follows:
import matplotlib.pyplot as plt
import numpy as np
# Randomize10000Data with a normal distribution
data= np.random.randn(10000)
# plot histograms, bins is the granularity, i.e. the number of long bars in the histogram, edgecolor is the color of the border of the long bars
plt.hist(data, bins=40, edgecolor='black')
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
The (10000) mentioned in subsection 2.1.3 is used to generate 10000 normally distributed data with mean 0 and standard deviation 1. The results are shown in the figure below, where the horizontal axis represents the randomly generated data and the vertical axis represents the number of times the data occurs, i.e., the frequency. In addition, if you want to display as a frequency histogram, you only need to set the parameter density to 1 can be.
Supplementary knowledge: quick drawing tips in pandas library
Above is the Matplotlib library of classic graphics drawing skills, in fact, for the pandas library in the data table, there is a more convenient way to write the code, but its essence is still called through the pandas library Matplotlib library, the demo code is as follows:
# This way of writing is only suitable for DataFrames in pandas, not directly for arrays in Numpy
import pandas as pd
df= pd.DataFrame(data) # Convert the data array in the plotted histogram into aDataFrame()Format
df.hist(bins=40, edgecolor='black')
- 1
- 2
- 3
- 4
() way you can quickly plot the same as before the histogram, here because df is only a number of columns, so you can write df, if df has more than one number of columns, then you need to specify which columns need to be plotted as a histogram, the plotting.Write as df['column name'].hist()This histogram plotting technique will be used in Section 14.3.1 when plotting a histogram of the number of movie reviews.
Also, in addition to writing (), you can plot with generic plotting code like the following from the pandas library:
df.plot(kind='hist')
- 1
Here is the histogram by setting the kind parameter to hist, through this generic drawing code, the pandas library in addition to conveniently drawing histograms, it can also quickly draw other graphs by setting the kind parameter, the demo code is as follows, first of all, through the knowledge points in section 2.2.1 to create a two-dimensional DataFrame table df.
import pandas as pd
df = pd.DataFrame([[8000, 6000], [7000, 5000], [6500, 4000]], columns=['Per capita income', 'Expenditures per capita'], index=['Beijing', 'Shanghai', 'Guangzhou'])
- 1
- 2
Demonstrate the 2D form df as shown in the table below:
At this point you can draw a line graph or bar graph through pandas at the same time, the code is as follows:
df['Per capita income'].plot(kind='line') # kind=line draws a line graph, without setting the default line graph
df['Per capita income'].plot(kind='bar') # kind=bar plotting
- 1
- 2
Here because the df has more than one column, so the first df ['column name'] way to first select the column data need to be plotted, the final effect is as follows, you can see that it will be line graphs and bar graphs are quickly plotted on the map. In addition, if you write directly plot () function, which does not pass the kind parameter, that is, written as df ['per capita income'].plot () is the default plot line graph.
In addition, if you set the kind parameter to pie, you can draw a pie chart, and if you set it to box, you can draw a box plot, the effect is as follows:
Here summarizes the pandas shortcut drawing skills, other skills to set the kind parameters are shown in the table below, interested readers can try to change the kind parameters into the following content to see the effect.
If the above drawing process appears in the Chinese garbled, you can add the following three lines of code in the forefront of the code to solve the problem of Chinese garbled, the three lines of code is to solve the Chinese garbled fixed writing, this will also be explained later.
import matplotlib.pyplot as plt # The following lines of code solve the Chinese garbage problem
plt.rcParams['-serif'] = ['SimHei'] # Use to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False # Resolve the negative sign'-'Problems with displaying as squares
- 1
- 2
- 3
2.3.2 Common Tips for Data Visualization
The following mainly explains the data visualization process commonly used in some of the tips, such as adding text description, add a legend, set dual axes, set the image size, set the Chinese and how to draw multiple graphs.
(1) Add a text description
Add a title to the drawing via (name); via (), () is used to add x-axis and y-axis labels.
import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [2, 4, 6]
plt.plot(x, y)
plt.title('TITLE') # Add Title
plt.xlabel('X') # Add X-axis labels
plt.ylabel('Y') # Add Y-axis labels
plt.show() # Show pictures
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
The results of the run are shown below:
(2) Adding a legend
Through () to add the legend, add before you need to set the lable (label) parameter, the code is as follows:
import numpy as np
import matplotlib.pyplot as plt
# The first line, Set the label lable to y= x + 1
x1 = np.array([1, 2, 3])
y1 = x1 + 1
plt.plot(x1, y1, label='y = x + 1')
# Second line, Set the label lable to y= x*2
y2 = x1*2
plt.plot(x1, y2, color='red', linestyle='--', label='y = x*2')
plt.legend(loc='upper left') # Legend position is set to the upper left corner
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
As shown in the figure below, two straight lines have been drawn and a legend has been added to the upper left corner, if you want to modify the location of the legend, such as setting it to the upper right corner, you can change the parameter loc (an abbreviation for location) to "upper right", and the lower right corner is set to " lower right".
(3) Setting the dual axes
The above example can draw two lines in a graph, but if the range of values of the two lines is relatively large difference, then the drawing of the graph is not very good, so how to draw two y-axis at this time? Can be drawn after the first figure, write the following line of code can be set to dual-axis.
plt.twinx()
- 1
It should be noted that if you set dual axes, then when you add the legend, you have to add it once for each drawing, not uniformly at the end. Here to y = x and y = x^2 as an example, to demonstrate how to set dual axes, the code is as follows:
import numpy as np
import matplotlib.pyplot as plt
# The first line, Set the label lable to y= x
x1 = np.array([10, 20, 30])
y1 = x1
plt.plot(x1, y1, color='red', linestyle='--', label='y = x')
plt.legend(loc='upper left') # The legend of the figure is set in the upper left corner
plt.twinx() # Setting up dual axes
# The second line, Set the label lable to y= x^2
y2 = x1*x1
plt.plot(x1, y2, label='y = x^2')
plt.legend(loc='upper right') # Reformat legend set in upper right corner
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
The result is shown in the following figure, you can see that the left and right y-axis values are very different. If you don't set dual axes, it will result in the line y = x being compressed flat, affecting the image display.
(4) Setting the image size
If you are not satisfied with the default image size, you can use the following code to set the image size:
plt.rcParams[''] = (8, 6)
- 1
The first element represents the length and the second the width, here the numbers 8 and 6 represent 800 and 600 pixels.
(5) Setting the X-axis angle
Sometimes the X-axis may be more content, resulting in data are crowded in a piece, this time we can set the angle of the x-axis to adjust the code is as follows, where 45 means 45 degrees, you can adjust the angle according to their needs.
import matplotlib.pyplot as plt
plt.xticks(rotation=45)
- 1
- 2
(6) Chinese display issues
When using matplotlib to draw graphs, the default situation is not support Chinese display, through the following code can solve the problem. The minus sign is not displayed due to the change of font, so we have to set minus to False in the allocation file.
import matplotlib.pyplot as plt
plt.rcParams['-serif'] = ['SimHei'] # Use to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False # Resolve the negative sign'-'Problems with displaying as squares
- 1
- 2
- 3
SimHei here is the English translation of bold, if you want to use other fonts, you can refer to the following font English comparison table:
(7) Multi-plotting
As shown in the figure below, sometimes we need to output multiple graphs on a canvas, in the Matplotlib library there is the current figure (figure) and the current axes (axes) concept, which corresponds to the current canvas and the current subgraphs, in a canvas (figure) can be plotted on multiple subgraphs (axes). Plotting multiple graphs usually use the subplot () function or subplots () function.
First to explain the subplot () function, as shown below, it usually contains three parameters, the number of rows and columns of the subplot and the first few subplots, for example, subplot (221) that is to draw 2 rows and 2 columns of the subplot (a total of 4 subplots), and in the first subplot on the plot.
The demo code is as follows:
import matplotlib.pyplot as plt
# Plot the first subplot: a line graph
ax1= plt.subplot(221)
plt.plot([1, 2, 3], [2, 4, 6]) # Here plt can actually be replaced by ax1 #
# Draw the second subplot: the bar chart
ax2= plt.subplot(222)
plt.bar([1, 2, 3], [2, 4, 6])
# Draw the third subplot: the scatterplot
ax3= plt.subplot(223)
plt.scatter([1, 3, 5], [2, 4, 6])
# Plotting the fourth subplot: histograms
ax4= plt.subplot(224)
plt.hist([2, 2, 2, 3, 4])
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
Here is just the right time to review the basic graphics of section 2.3.1 related to the drawing method, the drawing results are shown below:
In order to enhance your understanding of canvas (figure) and subgraphs (axes), let's do a simple demonstration with the following code:
plt.rcParams[''] = (8, 4) # Set the canvas size
plt.figure(1) # First canvas
ax1= plt.subplot(121) # The first subplot of the first canvas
plt.plot([1, 2, 3], [2, 4, 6]) # The plt here can be replaced by ax1 #
ax2= plt.subplot(122) # The second subfigure of the first canvas
plt.plot([2, 4, 6], [4, 8, 10])
plt.figure(2) # Second canvas
plt.plot([1, 2, 3], [4, 5, 6])
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
which the first line of code to set the size of each canvas for 800 * 400 pixels; the third line of code (1) to create the first canvas, and then 4-8 lines of code through the subplot () function to draw two subplots; the 10th line of code (2) to create the second canvas, and then this canvas has only one subplot. Drawing results as shown below, two canvases is actually drawing two diagrams, of which the first canvas has two sub-diagrams, the second canvas only a sub-diagram.
in the use of subplot () function, each time a new subplot on the drawing, you have to call subplot () function, for example, the fourth subplot will have to be written as ax4 = (224), then there is no way to generate multiple subplots at once? This time you can use the subplots () function, the code is as follows:
fig, axes = plt.subplots(nrows=2, ncols=2)
ax1, ax2, ax3, ax4 = axes.flatten()
- 1
- 2
The first line of code subplots () function has two main parameters, nrows that the number of rows, ncols that the number of columns, here is to draw 2 rows and 2 columns of the subplot (a total of 4 subplots), it will return two contents: fig (canvas) and axes (subplot collection, in the form of arrays to store the various subplots), you can also abbreviate it as: fig, axes = (2, 2) which can also be abbreviated as: fig, axes = (2, 2);
The second line of code through the flatten () function to expand the set of subplots, so as to obtain the subplots, here because it is known to be 4 subplots, so write "ax1, ax2, ax3, ax4" represent 4 subplots, after which you can draw the four subplots in the picture, the demo code is as follows, here using the subplots function of the abbreviated way, and set the image size figsize for 1000 * 800 pixels.
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
ax1, ax2, ax3, ax4 = axes.flatten()
ax1.plot([1, 2, 3], [2, 4, 6]) # Draw the first subgraph
ax2.bar([1, 2, 3], [2, 4, 6]) # Draw the second subgraph
ax3.scatter([1, 3, 5], [2, 4, 6]) # Draw the third subgraph
ax4.hist([2, 2, 2, 3, 4]) # Drawing the fourth subgraph
- 1
- 2
- 3
- 4
- 5
- 6
The final drawing result is shown below:
In addition, if you want to set the subplot title, X-axis label or Y-axis label in the subplot generated by subplot() function or subplots() function, you have to set it by set_title() function, set_xlabel() function, set_ylabel() function, the demo code is as follows:
plt.rcParams['-serif'] = ['SimHei'] # Used to display Chinese labels properly
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
ax1, ax2, ax3, ax4 = axes.flatten()
ax1.plot([1, 2, 3], [2, 4, 6]) # Draw the first subgraph
ax1.set_title('Subfigure 1')
ax1.set_xlabel('Date')
ax1.set_ylabel('Score')
ax2.bar([1, 2, 3], [2, 4, 6]) # Draw the second subgraph
ax3.scatter([1, 3, 5], [2, 4, 6]) # Draw the third subgraph
ax4.hist([2, 2, 2, 3, 4]) # Drawing the fourth subgraph
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
Here the first line set the Chinese font for bold (SimeHei) and make the Chinese does not appear garbled, the final drawing results are shown in the figure below, you can see that the first sub-figure has added the title and X-axis coordinates and Y-axis coordinates and other content.
To summarize, subpot drawing in the canvas, each time to call subplot to specify the location, and subplots () can be generated at a time more than one subplot , in the call only need to call the generation of subplot ax can be.
2.4 Comprehensive Case Study - Stock Data Reading and K-Line Charting
Library updates
This one is going to be more of an update, so you can add a sentence to the book asking him to scan the QR code on the front of the book to view that errata document.
There are some updates to p66-p72, mainly due to the fact that the original mpl_finance library was deprecated, and the new library is now called mplfinace, with some updates to the corresponding code.
Specifically, you can refer to the following document: /docs/pwwhYchCjkQt9YwK/ "stock K line charting", you can copy the link and use the Graphite Document App or applet to open it
Also, if tushare prompts the following message, use it if it still works. If it doesn't work, refer to the official documentation to switch to the pro version.
Tutorial on how to use Tushare Pro:
/docs/PGWdY6Q8xDKpvDXG/ "Tushare Pro Tutorial (I)", you can copy the link and open it with Graphite Document App or applet.
In this section, we will review and apply the knowledge of the pandas and Matplotlib libraries through a comprehensive case study: stock data reading and K-plotting. We will start from easy to difficult, first we will have a preliminary attempt: stock data reading and visualization, and then we will have a comprehensive case study: stock K-plotting.
2.4.1 Preliminary Attempts - Stock Data Reading and Visualization
Here we use a simple case of reading and visualizing stock data to consolidate what we learned earlier.
1. Stock Database: Installation and Use of Tushare Library
First of all, it is recommended to install the Tushare library (Tushare library official address is: /) which can call the stock price data through the PIP installation method, in order toWindowssystem, for example, the specific method is: Win + R key combination to bring up the run box, type cmd and enter, and then in the pop-up box, type pip install tushare and then press Enter to install the method. If you are installing in the Jupyter Notebook editor described in section 1.2.3, just type the following code in the code box and run the code box (in English!). :
!pip install tushare
- 1
About more Tushare and other stock-related libraries can refer to the book section 8.2 "stock prediction model" related to knowledge, here mainly to review the pandas library and Matplotlib library related to knowledge.
We can get the basic data of the stock just by following 2 lines of code:
import tushare as ts
df = ts.get_k_data('000002', start='2009-01-01', end='2019-01-01')
df.head()
- 1
- 2
- 3
Line 1 introduces the tushare library and shortens it to ts;
The second line of code obtains the daily level data of the stock of listed company Vanke from 2009-01-01 to 2019-01-01, where the first parameter '000002' indicates the stock code (stock code The first parameter '000002' represents the stock code (the stock code '000002' corresponds to Vanke A), the second parameter start parameter represents the start date, and the third parameter end parameter represents the end date. Here the acquisition is a two-dimensional table structure of the DataFrame mentioned in Section 2.2.1, and the result of the acquisition is given to the variable df; the line of code can also be abbreviated as:
df = ts.get_k_data('000002', '2009-01-01', '2019-01-01')
- 1
The third line of code gets the first five rows of the table via the () talked about in section 2.2.3, if not JupyterNotebookeditor, then you need to print it through the print() function, written as print(()), the final effect is shown below:
Where date for the transaction date, open for the opening price, high for the highest price, close for the closing price, low for the lowest price, volume for the volume, code for the stock code.
At this point, if you want to get the stock data into the Excel file, you can use the relevant knowledge points in section 2.2.2, the code is as follows:
df.to_excel('Share price data.xlsx', index=False)
- 1
Which set the index parameter to False, that is, ignore the original line index, and the use of the relative file path (related knowledge see section 2.2.2 Supplemental Knowledge Points), and ultimately will be in the code where the folder to generate an Excel file: stock price data.xlsx.
2. Plotting stock price charts
Already have the stock price data, we can visualize the way to show it, here we first use the 2.2.1 section of the supplementary knowledge in the set_index () function will be set to the date of the line index, so that it is convenient to wait for the direct use of the pandas library for plotting, the code is as follows:
df.set_index('date', inplace=True)
- 1
The two-dimensional table at this point is shown below:
The code for graphing through the knowledge points related to pandas plotting in Section 2.3.1 Supplementary Knowledge Points is as follows. Because the plot() function in the pandas library plots line graphs by default, it is sufficient to write plot() directly without passing in the kind parameter. In addition, in the financial sector, usually use the closing price as the price of the day to draw the stock price chart, so here the choice is the close of this column.
df['close'].plot()
- 1
pandas library plot () function defaults to the row index as the horizontal axis coordinates, we previously set the date for the row index, so the final plotting results are shown below:
If you want to add a title to the picture, in the pandas library can be used in plot () can be passed inside a title parameter, the code is as follows, pay attention to because the title is the Chinese content, so write 2.3.2 section of the last two lines of code to prevent the Chinese code.
import matplotlib.pyplot as plt
plt.rcParams['-serif'] = ['SimHei'] # Used to display Chinese tags properly
df['close'].plot(title='Vanke Stock Chart')
- 1
- 2
- 3
Get the result as shown below:
Supplementary knowledge: Notes on drawing directly with the Matplotlib library
The above use of pandas library plot () function, pandas library is actually integrated with some of the functions of the Matplotlib library, if some readers would like to use the Matplotlib library directly for the stock price trend drawing, you can use the following code:
# Getting stock price data through the Tushare library
import tushare as ts
df= ts.get_k_data('000002', start='2009-01-01', end='2019-01-01')
# Details to note: Adjusting the date format so that the horizontal coordinates are clearly displayed
from datetime import datetime
df['date'] = df['date'].apply(lambda x:datetime.strptime(x,'%Y-%m-%d'))
# Draw a line graph
import matplotlib.pyplot as plt
plt.plot(df['date'], df['close'])
plt.show()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
The result of the plotting is shown below:
One of the things to watch out for is the following two lines of code:
from datetime import datetime
df['date'] = df['date'].apply(lambda x:datetime.strptime(x,'%Y-%m-%d'))
- 1
- 2
Because df['date'] is a string, if you use it to draw the graph directly, the X-axis coordinates will appear very dense, which is rather aesthetically displeasing. So here we convert it to timestamp format through the () method, so that Matplotlib will automatically display the date at intervals. You can see that compared to using the pandas library to draw, directly using the Matplotlib library is a little bit more complicated, you have to convert the date format.
2.4.2 Comprehensive Practice - Stock K-Line Charting
In the last section we learned how to pull basic stock price data from the Tushare library and how to plot stock price charts, in this section we'll learn even more exciting stock K chart plotting.
1. Basic knowledge of stock K-line charts
An actual stock K chart is shown below (this is the daily level K chart of the stock "Guizhou Maotai"):
Readers who have not been exposed to the stock may be confused by the various bar charts and line charts, and these graphs are actually plotted by some very basic data, this section will be mainly to popularize the basics of stock K charts.
These bar charts, often called "K charts", are drawn from four prices of a stock: the opening price (the price at which trading began at 9:30 a.m. on that day), the closing price (the price at which trading ended at 3:00 p.m. on that day), the high price (the highest of the day's fluctuations in the price of the stock), the low price (the lowest of the day's fluctuations in the price of the stock), or simply "high, open, low, close". The four prices are referred to as "high, open, low and close".
As shown in the chart below, based on these four prices can be plotted in red and green K chart, because the shape of the candle, so often called candlestick charts. K charts are divided into two kinds, if the day's closing price is higher than the opening price, that is, the price of the day, it is called a positive line, usually plotted in red; conversely, if the day's closing price is lower than the opening price, that is, the price of the day, it is called a negative line, usually plotted in green. If the closing price is lower than the opening price, i.e., the price is falling, it is called a negative line, usually plotted in green. As a side note, in the United States, it is instead red for down and green for up.
Here is another explanation of the principle of drawing SMA charts, that is, those line charts. SMA is divided into 5-day average (usually called MA5), 10-day average (usually called MA10), 20-day average (usually called MA20), etc., the principle is that the stock's closing price for the average, for example, the 5-day average is the average of the sum of the closing prices of the last five consecutive trading days, the specific formula is as follows, which Close1 for the day's closing prices, Close2 for the The previous day's closing price, and so on.
MA5 = (Close1 + Close2 + Close3 + Close4 + Close5)/5
- 1
The value of each 5-day average into a smooth curve is the 5-day average chart, the same 10-day average chart and 20-day average chart is a similar principle, these averages are also in this subsection at the very beginning of the chart to see the chart of those line charts.
Once you understand the basics of stock K charting, let's move on to K charting.
2. Drawing stock K-line charts
Drawing a stock K chart is not complicated, but you have to do some preparation, so let's go through it step by step.
(1) Installation of the library for drawing K-line charts: mpl_finance library
First of all, you need to install the relevant libraries to draw K charts: mpl_finance library, the installation method is a little tricky, recommended by the PIP installation method to install, as an example of the Windows system, the specific method is: through the Win + R key combination to bring up the Run box, type cmd after the return, and then in the pop-up box, type the following, and then press Enter to install the key back to the key:
pip install https:///matplotlib/mpl_finance/archive/
- 1
If you are installing in the Jupyter Notebook that you talked about in section 1.2.3, put an English exclamation point "!" in front of pip. and then just run the block of code.
!pip install https:///matplotlib/mpl_finance/archive/
- 1
After installing the mpl_finance library you can call one of the candlestick_ochl () function to draw a K-line chart or candlestick chart, in the formal drawing before, we also need to do some preliminary data preparation work.
(2) Introduction of mapping-related libraries
First of all, introduce some libraries for drawing, the code is as follows:
import tushare as ts
import matplotlib.pyplot as plt
import mpl_finance as mpf
import seaborn as sns
sns.set()
- 1
- 2
- 3
- 4
- 5
The first one introduces the Tushare library talked about in section 2.4.1, the second one introduces the Matplotlib library talked about in section 2.3.1 in; the third one introduces the mpl_finance library that was just installed; and the fourth one, the seaborn library, is a chart beautification library that can be activated via (), and if it is a Python installation via Anaconda in section 1.2.1, it comes with that library. If you installed Python via Anaconda in section 1.2.1, the library comes with it and does not need to be installed. Just take the code above and run it.
(3) Obtaining basic stock data through the Tushare library
Tushare library to obtain the stock price data of the stock code "000002" "Vanke A" from 2019-06-01 to 2019-09-30, the code is as follows:
df = ts.get_k_data('000002','2019-06-01', '2019-09-30')
- 1
Get the result as shown below:
(4) Date formatting and table conversion
In the K chart before drawing, have to do a little data preparation, this part is slightly more complex, but the actual application of the source code can be taken directly from the accompanying past use, the following principles for interested readers to learn.
Because the candlestick_ochl () function to draw a K-line chart can only receive a specific format of the date format, as well as the contents of the array format, so we need to adjust the original text type of the date format, the code is as follows:
# Import the two libraries involved in date formatting
from matplotlib.pylab import date2num
import datetime
# Convert date data obtained by tushare tocandlestick_ohlc()The format of the numbers that can be read by the function
defdate_to_num(dates):
num_time = []
for date in dates:
date_time = datetime.datetime.strptime(date,'%Y-%m-%d')
num_date = date2num(date_time)
num_time.append(num_date)
return num_time
# Convert the DataFrame to a two-dimensional array and use thedate_to_num()Functions to convert dates
df_arr= df.values # Converts DataFrame data to an array
df_arr[:,0] = date_to_num(df_arr[:,0]) # Conversion of dates from the original date format to a numeric format
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
The first three lines of code begin by introducing the two libraries that load the date format;
Then 5-12 lines of code to define the date format of the conversion function to facilitate the call after the function is mainly a text format date through the strptime () function converted to a timestamp format date, and then through the date2num () function will be converted to a numerical format of the date; line 15 of the code through the values attribute will be the original two-dimensional DataFrame format The fifteenth line of code through the values attribute will be the original DataFrame format of two-dimensional tables converted to Numpy format two-dimensional arrays, because the candlestick_ochl () function to draw K charts can only receive two-dimensional arrays;
The 16th line of code through the 5-12 lines of the definition of date_to_num () function to convert the original date format, df_arr[:,0] in the ":" that all the rows, "0" that the first column, so it represents the two-dimensional array of the first column, that is, the date "date" that column.
Interested readers can also print df_arr[0:5] to show the first 5 rows of the converted format, when the converted two-dimensional data into the following form:
array([[737213.0, 26.81, 26.44, 27.02, 26.28, 317567.0, '000002'],
[737214.0, 26.47, 26.3, 26.54, 26.25, 203260.0, '000002'],
[737215.0, 26.64, 27.03, 27.28, 26.63, 576164.0, '000002'],
[737216.0, 27.01, 27.12, 27.29, 26.92, 333792.0, '000002'],
[737220.0, 27.29, 27.81, 28.05, 27.17, 527547.0, '000002']],
dtype=object)
- 1
- 2
- 3
- 4
- 5
- 6
You can see the beginning of the DataFrame format of two-dimensional data into a two-dimensional array format, and the date of the content of the column from the text type of date converted to a digital format of the date, so that it is convenient to use after the plotting of the K chart candlestick_ochl () function.
(5) Plotting K-Line Charts
After converting the data format, the K chart is relatively simple to draw, through the candlestick_ochl () function will be able to easily draw the K chart, the code is as follows:
fig, ax = plt.subplots(figsize=(15,6))
mpf.candlestick_ochl(ax, df_arr, width=0.6, colorup='r', colordown='g', alpha=1.0)
plt.grid(True) # Draw the grid
ax.xaxis_date() # Set the x-axis scale to date
- 1
- 2
- 3
- 4
The first line of code to create a canvas and sub-picture (here only a sub-picture), here at the same time through the figsize parameter to set the picture pixel size of 1500 * 600 pixels; the second line of code is to draw the K-line chart of the core code candlestick_ochl () function, the meaning of its main parameters are shown below:
The df_arr here is the stock price history data obtained after the data processing in the previous step; then we set the colorup parameter to "r" (an abbreviation for red red), i.e., when the stock rises, it is set to red; the colordown parameter is set to "g " (abbreviation for green green), that is, when the stock falls set to green.
The third line of code draws a grid through (True), so that there are grid lines in the chart; the fourth line of code through the xaxis_date () function to set the x-axis scale as a date, that is to say, it is converted to a numerical format of the date in the original date format to the regular date format display.
The final plotting result is shown below:
(6) Plotting K-charts and SMAs
With the K-line chart, let's make up the average chart, here we mainly make up the 5-day average and 10-day average chart, first we construct the 5-day average and 10-day average data by the following code:
df['MA5'] = df['close'].rolling(5).mean()
df['MA10'] = df['close'].rolling(10).mean()
- 1
- 2
Through the rolling () function and mean () function can be directly seeking average data, if you want to seek 20 or 30 day average, just need to roll () function in the number of 20 or 30 can be replaced. Interested readers can view the first 15 lines of data of the df at this time through (15), as shown below:
You can see that the 5-day average MA5 column of the first 4 lines of data for the empty data, this is because the 5-day average data taken from the stock price for five consecutive trading days of the closing price of the average value, and the first four days to put together less than five days, and therefore can not be calculated on the average of the closing price of the five days, so for the null value; Similarly, MA10 column of the first 9 rows of the value of the null value, to the 10th line of the beginning of the data.
Once you have the 5-day SMA and 10-day SMA data, you can plot it on the graph with the following code:
plt.rcParams['-serif'] = ['SimHei'] # Used to display Chinese labels properly
fig, ax = plt.subplots(figsize=(15,6))
mpf.candlestick_ochl(ax, df_arr, width=0.6, colorup='r', colordown='g', alpha=1.0)
plt.plot(df_arr[:,0], df['MA5']) # Drawing5Daily SMA
plt.plot(df_arr[:,0], df['MA10']) # Drawing10daily average (in economics)
plt.grid(True) # Drawing the grid
plt.title('Vanke A') # Setting the title
plt.xlabel('Date') # Setting the X-axis legend
plt.ylabel('Price') # Setting the Y-axis legend
ax.xaxis_date () # Set the x-axis scale to date
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
which the first line of code using the 2.3.2 section to display the knowledge of Chinese to set the Chinese font for bold (SimHei) and does not appear in Chinese garbled problems, here through the following two lines of code to draw the 5-day SMA and 10-day SMA:
plt.plot(df_arr[:,0], df['MA5']) # Drawing5Daily SMA
plt.plot(df_arr[:,0], df['MA10']) # Drawing10daily average (in economics)
- 1
- 2
The essence is to use the Matplotlib library plot () function to draw line graphs, which df_arr[:,0] is the previous step three in the processing of the date (":" that all rows, "0" that the first column, so it is the first column of the two-dimensional array, that is, the date "date" column). represents the first column of the two-dimensional array, that is, the date "date" that column).
At this time the image is shown below, you can see that the average chart has been plotted to the image, here we also set the image title and horizontal and vertical axis labels.
(7) Drawing stock K charts, SMA charts, volume bar charts
In reality, and stock K charts, average charts appear together with the daily volume of the bar chart, we use the 2.3.2 section of the knowledge of drawing multiple charts, you can use the following code in a canvas to draw two sub-charts, including K charts, average charts, volume bar charts:
fig, axes = plt.subplots(2, 1, sharex=True, figsize=(15,8))
ax1, ax2 = axes.flatten()
# Plotting the first subchart: K and SMA charts
mpf.candlestick_ochl(ax1, df_arr, width=0.6, colorup = 'r', colordown = 'g', alpha=1.0)
ax1.plot(df_arr[:,0], df['MA5']) # Drawing5daily moving average (DMA)
ax1.plot(df_arr[:,0], df['MA10']) # Drawing10daily average (in economics)
ax1.set_title('Vanke A') # Set the sub-map title
ax1.set_ylabel('Price') # Set the subplot y-axis labels
ax1.grid(True)
ax1.xaxis_date()
# Plotting the second sub-chart: volume charts
ax2.bar(df_arr[:,0], df_arr[:,5]) # Plotting Volume Histograms
ax2.set_xlabel('Date') # Set the subplot x-axis labels
ax2.set_ylabel('Volume') # Set the subplot y-axis labels
ax2.grid(True)
ax2.xaxis_date()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
The first two lines of code using the 2.3.2 section of the plotting of multiple charts related knowledge first construct a canvas and two subplots, here at the same time set the sharex parameter to True, so that the two subplots can share a common axes; lines 4-13 to draw the first subplot, which set the title in the subplot or the title of the axes have to use set_title (), set_ ylabel(), set_xlabel() such a function; lines 15-20 to draw the second sub-chart: volume chart, which df_arr[:,0] that the two-dimensional array of the first column, that is, the date of that column, df_arr[:,5] that the two-dimensional array of the sixth column, that is, the volume of that column of data, and then through the bar () function, as described in section 2.3.1 plotted as a bar chart.
The final drawing result is shown below:
We can compare it with the actual image on the Sina Finance website, as shown below, and find that the images related to the K-line chart drawn through Python are basically the same as the images on the website.
At this point, the data analysis related to the 3 major arsenal has been explained to you, in fact, about these three libraries there are many more can be tapped into the knowledge, due to space limitations, here will not repeat. This chapter is relatively more content, readers and friends can use this chapter as a tool chapter, when necessary, and then return to look at the need to use the knowledge points.
/wuwei_201/article/details/105815728: new tutorial 1
/Wilburzzz/article/details/107792381 Compare the new tutorial 2.
pip install mplfinance
- 1
The code is as follows:
import numpy as np
import pandas as pd
import tushare as ts
import mplfinance as mpf
import matplotlib.pyplot as plt
from pylab import mpl
from datetime import datetime
#pd.set_option()It's the pycharm output control display settings.
pd.set_option('expand_frame_repr', False)#True means that line breaks can be displayed. Setting it to False does not allow line breaks.
pd.set_option('display.max_columns', None)# Show all columns
#pd.set_option('display.max_rows', None)# Show all lines
pd.set_option('colheader_justify', 'centre')# Display centered
pro= ts.pro_api('9d674d000f7c730dd3108701a1a1c534bf51bfb03a0ff169a9d11848') # https:///user/token
mpl.rcParams['-serif'] = ['SimHei'] # Specify the default font
mpl.rcParams['axes.unicode_minus'] = False # Resolve save image as negative sign'-'Problems with displaying as a square
df= pro.daily(ts_code='', start_date='20200101', end_date='20200801')
#df.sort_values(by='trade_date',ascending=False)
data = df.loc[:, ['trade_date', 'open', 'close', 'high', 'low', 'vol']] #: take all rows of data, followed by date column, open column, etc.
data= data.rename(columns={'trade_date': 'Date', 'open': 'Open', 'close': 'Close', 'high': 'High', 'low': 'Low', 'vol': 'Volume'}) # Replace column names in preparation for later function variables.
data.set_index('Date', inplace=True) # Set the date column as an index, overwriting the original index,The index is still of type object, which is a string type.
data.index = pd.DatetimeIndex(data.index) # Convert object type to DateIndex type, pd.DatetimeIndex converts a column and sets the data in that column to the index index.
data= data.sort_index(ascending=True) # Ascending the chronological order to conform to the time series
mpf.plot(data, type='candle', mav=(5, 10, 20), volume=True, show_nontrading=False)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
2.5 Curriculum-related resources
How to get the author: micro-signal to get
Add the following wechat: huaxz001 .
The author's website:
Yutao Wang related courses can be passed:
Jingdong Link:[/Search?keyword=Wang Yutao], search for "Wang Yutao", inTaobao, DangdangAlso available for purchase. To join the learning exchange group, you can add the following WeChat: huaxz001 (please specify the reason).
Various courses are available atNetease Cloud, 51CTOSearch for Yutao Wang to view.