Article Directory
- 7.3 String Manipulation (String Processing)
- 1 String Object Methods
- 2 Regular Expressions
- 3 Vectorized String Functions in pandas (String vectorized function in pandas)
7.3 StringManipulation (String Processing)
python
Many built-in methods are very suitable for handlingstring
. And for more complex patterns, regular expressions can be used in conjunction with them. andpandas
There are two ways to mix.
1 String Object Methods
moststring
To handle it, use some built-in methods is enough. For example, you can usesplit
To split strings distinguished by commas:
val = 'a,b, guido'
- 1
val.split(',')
- 1
['a', 'b', ' guido']
- 1
split
Often andstrip
Use it together to remove spaces (including line breaks):
pieces = [x.strip() for x in val.split(',')]
pieces
- 1
- 2
['a', 'b', 'guido']
- 1
You can use the + sign to connect :: and strings:
first, second, third = pieces
- 1
first + '::' + second + '::' + third
- 1
'a::b::guido'
- 1
But this method does notpython
, the faster way is to use it directlyjoin
method:
'::'.join(pieces)
- 1
'a::b::guido'
- 1
Some other methods are suitable for locking substring position related. usein
Keyword is detectionsubstring
The best way, of course,index
andfind
Can also complete tasks:
'guido' in val
- 1
True
- 1
val.index(',')
- 1
1
- 1
val.find(':')
- 1
-1
- 1
Noticeindex
andfind
The difference. If you are looking forstring
If it doesn't exist,index
An error will be reported. andfind
Will return -1:
val.index(':')
- 1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-280f8b2856ce> in <module>()
----> 1 (':')
ValueError: substring not found
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
count
Will return onesubstring
Number of occurrences:
val.count(',')
- 1
2
- 1
replace
Will replace a way of occurrence (pattern
). Also commonly used for deletionpattern
, just pass an empty string:
val.replace(',', '::')
- 1
'a::b:: guido'
- 1
val.replace(',', '')
- 1
'ab guido'
- 1
2 Regular Expressions
Regular expressions allow us to find more complex onespattern
. Usually, an expression is calledregex
, a string pattern is represented by regular expression language. Availablepython
Built-inre
module to use.
Regarding regular expressions, there are many teaching resources. You can find a few articles to learn. I won’t introduce them too much here.
re
There are three categories of modules:patther matching
(Pattern matching),substitution
(replace),splitting
(segmentation). Usually these three are related, oneregex
Used to describe apattern
, there will be many uses. Here is an example, suppose we want to use spaces (tabs
,spaces
,newlines
) to split a string. Used to describe one or more spacesregex
yes\s+
:
import re
- 1
text = "foo bar\t baz \tqux"
- 1
re.split('\s+', text)
- 1
['foo', 'bar', 'baz', 'qux']
- 1
When called('\s+', text)
When the regular expression is firstcompile
Compile, andsplit
Methods will be called to searchtext
. We can compile it ourselvesregex
,use, can generate a , which can be used multiple times
regex object
:
regex = re.compile('\s+')
- 1
regex.split(text)
- 1
['foo', 'bar', 'baz', 'qux']
- 1
If you want to get the rightregex
All results oflist
The result is returned, and can be usedfindall
method:
regex.findall(text)
- 1
[' ', '\t ', ' \t']
- 1
To prevent \'s escape in regular expressions, it is recommended to use raw string literal, for example
r'C:\x'
, not using'C:\\x
useCreate a
regex object
is highly recommended if you plan to use an expression for manystring
If you go, this will saveCPU
resources.
match
andsearch
,andfindall
Close relationship. butfindall
will return all matching results, andsearch
Only the first match will be returned. More strictly speaking,match
Match onlystring
The beginning part. Here is an example to illustrate that we want to find all email addresses:
text = """Dave dave@
Steve steve@
Rob rob@
Ryan ryan@ """
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
- 1
- 2
- 3
- 4
- 5
- 6
# makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
- 1
- 2
usefindall
Find a set of email addresses:
regex.findall(text)
- 1
['dave@', 'steve@', 'rob@', 'ryan@']
- 1
search
returntext
The first match result in.match object
Can tell us the results we foundtext
Start and end positions:
m = regex.search(text)
- 1
m
- 1
<_sre.SRE_Match object; span=(5, 20), match='dave@'>
- 1
text[m.start():m.end()]
- 1
'dave@'
- 1
return
None
because it will only be inpattern
Exist instrng
The matching result will be returned only in the beginning:
print(regex.match(text))
- 1
None
- 1
andsub
Return a new onestring
,Bundlepattern
Replace the place we specifiedstring
:
print(regex.sub('REDACTED', text))
- 1
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
- 1
- 2
- 3
- 4
Suppose you want to find the email address, and at the same time, you want to divide the email address into three parts.username, domain name, and domain suffix
. (Username, domain name, domain name suffix). Need to give eachpattern
Add a bracket:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
- 1
regex = re.compile(pattern, flags=re.IGNORECASE)
- 1
match object
Will return onetuple
, including multiplepattern
Components, bygroups
method:
m = regex.match('wesm@')
- 1
m.groups()
- 1
('wesm', 'bright', 'net')
- 1
findall
Will returna list of tuples
:
regex.findall(text)
- 1
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
- 1
- 2
- 3
- 4
sub
Accessiblegroups
The result is, but special symbols \1, \2 are used. \1 means the first matchgroup
, \2 means the second matchinggroup
, and so on:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
- 1
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
- 1
- 2
- 3
- 4
3 Vectorized String Functions in pandas(String vectorization in pandasfunction)
Some complex data cleaning,string
There will be missing values:
import numpy as np
import pandas as pd
- 1
- 2
data = {'Dave': 'dave@', 'Steve': 'steve@',
'Rob': 'rob@', 'Wes': np.nan}
- 1
- 2
data = pd.Series(data)
data
- 1
- 2
Dave dave@
Rob rob@
Steve steve@
Wes NaN
dtype: object
- 1
- 2
- 3
- 4
- 5
data.isnull()
- 1
Dave False
Rob False
Steve False
Wes True
dtype: bool
- 1
- 2
- 3
- 4
- 5
You can use some string methods and regular expressions (usinglambda
or other functions) are used for eachvalue
Up, by, but this will get
NA(null)
value. To solve this problem,series
There are someArrayThe guided method can be used for string operations to skipNA
value. These methods can beseries
ofstr
Attributes; for example, we want to check if each email address has'gmail' with
:
data.str
- 1
< at 0x111f305c0>
- 1
data.str.contains('gmail')
- 1
Dave False
Rob True
Steve True
Wes NaN
dtype: object
- 1
- 2
- 3
- 4
- 5
Regular expressions can also be used, with anyre
Options, such asIGNORECASE
:
pattern
- 1
'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
- 1
data.str.findall(pattern, flags=re.IGNORECASE)
- 1
Dave [(dave, google, com)]
Rob [(rob, gmail, com)]
Steve [(steve, gmail, com)]
Wes NaN
dtype: object
- 1
- 2
- 3
- 4
- 5
There are many ways to use vectorization. for exampleor
index
Index tostr
property:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches
- 1
- 2
/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.
if __name__ == '__main__':
Dave (dave, google, com)
Rob (rob, gmail, com)
Steve (steve, gmail, com)
Wes NaN
dtype: object
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
To access nestinglist
We can pass in an element inindex
Give the function:
matches.str.get(1)
- 1
Dave google
Rob gmail
Steve gmail
Wes NaN
dtype: object
- 1
- 2
- 3
- 4
- 5
matches.str.get(0)
- 1
Dave dave
Rob rob
Steve steve
Wes NaN
dtype: object
- 1
- 2
- 3
- 4
- 5
You can also use this syntax to slice:
data.str[:5]
- 1
Dave dave@
Rob rob@g
Steve steve
Wes NaN
dtype: object
- 1
- 2
- 3
- 4
- 5