Pandas Tutorial: String Manipulation String Processing and Regular Expressions Re

Article Directory

7.3 String Manipulation (String Processing)
1 String Object Methods
2 Regular Expressions
3 Vectorized String Functions in pandas (String vectorized function in pandas)

7.3 StringManipulation (String Processing)

pythonMany built-in methods are very suitable for handlingstring. And for more complex patterns, regular expressions can be used in conjunction with them. andpandasThere are two ways to mix.

1 String Object Methods

moststringTo handle it, use some built-in methods is enough. For example, you can usesplitTo split strings distinguished by commas:

val = 'a,b, guido'

val.split(',')

['a', 'b', ' guido']

splitOften andstripUse it together to remove spaces (including line breaks):

pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

You can use the + sign to connect :: and strings:

first, second, third = pieces

first + '::' + second + '::' + third

'a::b::guido'

But this method does notpython, the faster way is to use it directlyjoinmethod:

'::'.join(pieces)

'a::b::guido'

Some other methods are suitable for locking substring position related. useinKeyword is detectionsubstringThe best way, of course,indexandfindCan also complete tasks:

'guido' in val

True

val.index(',')

val.find(':')

-1

NoticeindexandfindThe difference. If you are looking forstringIf it doesn't exist,indexAn error will be reported. andfindWill return -1:

val.index(':')

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-11-280f8b2856ce> in <module>()
----> 1 (':')


ValueError: substring not found

countWill return onesubstringNumber of occurrences:

val.count(',')

replaceWill replace a way of occurrence (pattern). Also commonly used for deletionpattern, just pass an empty string:

val.replace(',', '::')

'a::b:: guido'

val.replace(',', '')

'ab guido'

2 Regular Expressions

Regular expressions allow us to find more complex onespattern. Usually, an expression is calledregex, a string pattern is represented by regular expression language. AvailablepythonBuilt-inremodule to use.

Regarding regular expressions, there are many teaching resources. You can find a few articles to learn. I won’t introduce them too much here.

reThere are three categories of modules:patther matching(Pattern matching),substitution(replace),splitting(segmentation). Usually these three are related, oneregexUsed to describe apattern, there will be many uses. Here is an example, suppose we want to use spaces (tabs，spaces，newlines) to split a string. Used to describe one or more spacesregexyes\s+:

import re

text = "foo    bar\t baz  \tqux"

re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

When called('\s+', text)When the regular expression is firstcompileCompile, andsplitMethods will be called to searchtext. We can compile it ourselvesregex,use, can generate a , which can be used multiple timesregex object：

regex = re.compile('\s+')

regex.split(text)

['foo', 'bar', 'baz', 'qux']

If you want to get the rightregexAll results oflistThe result is returned, and can be usedfindallmethod:

regex.findall(text)

['    ', '\t ', '  \t']

To prevent \'s escape in regular expressions, it is recommended to use raw string literal, for exampler'C:\x', not using'C:\\x

useCreate aregex objectis highly recommended if you plan to use an expression for manystringIf you go, this will saveCPUresources.

matchandsearch,andfindallClose relationship. butfindallwill return all matching results, andsearchOnly the first match will be returned. More strictly speaking,matchMatch onlystringThe beginning part. Here is an example to illustrate that we want to find all email addresses:

text = """Dave dave@ 
          Steve steve@ 
          Rob rob@ 
          Ryan ryan@ """

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

#  makes the regex case-insensitive 
regex = re.compile(pattern, flags=re.IGNORECASE)

usefindallFind a set of email addresses:

regex.findall(text)

['dave@', 'steve@', 'rob@', 'ryan@']

searchreturntextThe first match result in.match objectCan tell us the results we foundtextStart and end positions:

m = regex.search(text)

<_sre.SRE_Match object; span=(5, 20), match='dave@'>

text[m.start():m.end()]

'dave@'

returnNonebecause it will only be inpatternExist instrngThe matching result will be returned only in the beginning:

print(regex.match(text))

None

andsubReturn a new onestring,BundlepatternReplace the place we specifiedstring：

print(regex.sub('REDACTED', text))

Dave REDACTED 
          Steve REDACTED 
          Rob REDACTED 
          Ryan REDACTED

Suppose you want to find the email address, and at the same time, you want to divide the email address into three parts.username, domain name, and domain suffix. (Username, domain name, domain name suffix). Need to give eachpatternAdd a bracket:

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

regex = re.compile(pattern, flags=re.IGNORECASE)

match objectWill return onetuple, including multiplepatternComponents, bygroupsmethod:

m = regex.match('wesm@')

m.groups()

('wesm', 'bright', 'net')

findallWill returna list of tuples:

regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

subAccessiblegroupsThe result is, but special symbols \1, \2 are used. \1 means the first matchgroup, \2 means the second matchinggroup, and so on:

print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com 
          Steve Username: steve, Domain: gmail, Suffix: com 
          Rob Username: rob, Domain: gmail, Suffix: com 
          Ryan Username: ryan, Domain: yahoo, Suffix: com

3 Vectorized String Functions in pandas(String vectorization in pandasfunction）

Some complex data cleaning,stringThere will be missing values:

import numpy as np
import pandas as pd

data = {'Dave': 'dave@', 'Steve': 'steve@', 
        'Rob': 'rob@', 'Wes': np.nan}

data = pd.Series(data)
data

Dave     dave@
Rob        rob@
Steve    steve@
Wes                  NaN
dtype: object

data.isnull()

Dave     False
Rob      False
Steve    False
Wes       True
dtype: bool

You can use some string methods and regular expressions (usinglambdaor other functions) are used for eachvalueUp, by, but this will getNA(null)value. To solve this problem,seriesThere are someArrayThe guided method can be used for string operations to skipNAvalue. These methods can beseriesofstrAttributes; for example, we want to check if each email address has'gmail' with :

data.str

< at 0x111f305c0>

data.str.contains('gmail')

Dave     False
Rob       True
Steve     True
Wes        NaN
dtype: object

Regular expressions can also be used, with anyreOptions, such asIGNORECASE：

pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object

There are many ways to use vectorization. for exampleorindexIndex tostrproperty:

matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.
  if __name__ == '__main__':





Dave     (dave, google, com)
Rob        (rob, gmail, com)
Steve    (steve, gmail, com)
Wes                      NaN
dtype: object

To access nestinglistWe can pass in an element inindexGive the function:

matches.str.get(1)

Dave     google
Rob       gmail
Steve     gmail
Wes         NaN
dtype: object

matches.str.get(0)

Dave      dave
Rob        rob
Steve    steve
Wes        NaN
dtype: object

You can also use this syntax to slice:

data.str[:5]

Dave     dave@
Rob      rob@g
Steve    steve
Wes        NaN
dtype: object