Regular expressions: repeating a capturing group and making the inner group non-repeating
In the dark old days when I was working on a Windows laptop, I used to use a tool called RegexBuddy to help me write regular expressions.
Up until recently, I didn't really get regular expressions. Working with Django forced me to change that and I'm so thankful I took the time to delve deeper into them. These days, I pretty much use them on a daily basis and it shaves so much time off of some mundane tasks that I can't believe I ever got by without them. (They're especially helpful since TextMate has a regular expression engine built into its Find and Replace dialog.)
Today, I needed to repeat a capture group that could occur one or more times in a string and I kept getting just the last iteration. Some quick googling brought up a very informative page on Repeating a Capturing Group vs. Capturing a Repeated Group by the author of RegexBuddy.
It appears I was making a common mistake by repeating a capturing group instead of capturing a repeated group.
The code in question is to parse Google App Engine datastore keys so that I capture the whole key path, including all ancestors. A sample string:
s = "datastore_types.Key.from_path('Parent', 1L, 'Child', 30L, _app=u'myapp')"
So my first attempt, the flawed one was:
r = r"datastore_types.Key.from_path\(('.*?', \d*?L, )+_app=u'.*?'\)" rc = re.compile(r) rc.match(s).groups() >>> ("'Child', 30L, ",)
What I should have written, to capture the repeated group:
r = r"datastore_types.Key.from_path\((('.*?', \d*?L, )+)_app=u'.*?'\)" rc = re.compile(r) rc.match(s).groups() >>>("'Parent', 1L, 'Child', 30L, ", "'Child', 30L, ")
This results in both the result for the outer group (the repeated group; what we want) and the last iteration of the inner group (which we don't care about).
To optimize it further, you can make the inner group non-capturing. So the final version looks like this:
r = r"datastore_types.Key.from_path\(((?:'.*?', \d*?L, )+)_app=u'.*?'\)" rc = re.compile(r) rc.match(s).groups() >>>("'Parent', 1L, 'Child', 30L, ", )
I may be more comfortable with regular expressions, but there's still so much to learn! :)
Update: And, like most things, the actual solution I ended up going with is much simpler:
r = r'datastore_types.Key.from_path\((.*?), _app'