Stripping quotes and apostrophes from HTML text

Need to find all ’ and " in a string, and replace them with ’ and "

…but ONLY if they’re not part of an HTML tag! (because that would screw up the HTML big-time :frowning: ).

Tried this with java regexp, but seems I broke the regexp engine (at least, as far as I can understand the cryptic error msg - it seems to be saying “the point of what you’re trying to achieve here … is something I’m programmed not to even attempt”)

“Exception: Look-behind group does not have an obvious maximum length near index 13 (?i)(?])’(?![^>]>) ^”

on regexp: “(?i)(?<!<[^>])’(?![^>]>)”

Is there another way of achieving this? The HTML will probably be poorly formed, so XML isn’t an option really.

could you not just loop thru the string(buffer?) considering each char at a time, if its one of those chars then u remove it and replace it with the other thing. you have a flag you set when you are inside a html tag and unset when you leave it so you know not to delete them inside the tags.?

I probably didnt read the post properly, it probably really isnt this simple. XD. but i had to put this just in cas eit is. :slight_smile:

whilst that would work, technically speaking, it’s pretty horrendous if I have to create such a hacked-up piece of code just to do a simple search/replace op :(.

Surely there’s a better way?

What about
“>[^<]*’”
doesn’t this regex do the job already?

That would remove EVERYTHING between the end of the tag (including the close-tag symbol!) and the apostrophe itself :).

actually I thought you could use it to find the positions of the “’” and this it should do fine. A “>” as the Tag-closer then some not “<” normal characters and it should stop at a “’”. mmh… “’” is also included in “[^<]” so this might be better:
“>[^<’]*’”
When you’ve found the position you can replace it and start searching for the next one.