Regex To Match All HTML Tags Except


- 1 answer

I need to match and remove all tags using a regular expression in Perl. I have the following:


But this still matches with the closing </p> tag. Any hint on how to match with the closing tag as well?

Note, this is being performed on xhtml.



I came up with this:


<           # Match open angle bracket
(?!         # Negative lookahead (Not matching and not consuming)
    \/?     # 0 or 1 /
    p           # p
    (?=     # Positive lookahead (Matching and not consuming)
    >       # > - No attributes
        |       # or
    \s      # whitespace
    .*      # anything up to 
    >       # close angle brackets - with attributes
    )           # close positive lookahead
)           # close negative lookahead
            # if we have got this far then we don't match
            # a p tag or closing p tag
            # with or without attributes
\/?         # optional close tag symbol (/)
.*?         # and anything up to
>           # first closing tag

This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.

It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.