Dandello wrote on Aug 10
th, 2014 at 10:54pm:
I'm actually pondering the best way to detect the CP1251 strings by using two or three characters together and checking against the non-ISO character list. There actually wouldn't be much of a problem at all with CP1251 except that there are overlaps in the character codes so ä gets converted to something else that's not right.
As for most YaBB forums the encoding will most likely be Latin1 or CP1251 as Chinese is converted internally to html entities. Had the guys done that with Cyrillic early on there wouldn't be a problem now.
By taking "ä" as an example, it is very likely that one of the surrounding characters is either "ä", "Ä" or low ASCII... that way you can eliminate out Cyrillic text. Using three characters and assuming there can't be three consecutive high ASCII characters in Latin-1 text, the result should be pretty promising... so patterns like:
1. low-ASCII + ä + word-boundary
2. low-ASCII + ä + ä
3. low-ASCII + ä + low-ASCII
4. Ä or ä + ä + low-ASCII
5. Ä or ä + low-ASCII + ä
low-ASCII here means a-z, A-Z, word-boundary means any low-ASCII character except a-Z or A-Z or any valid high-ASCII character that is not a-z or A-Z with diacritical marks, meaning character codes 32-64, 91-96, 123-126, 160-191, 215 or 247.
Care should be taken not to mistake Latin-15 (Latin-9, Windows-28605) encoded text as invalid Latin-1 text, because those can be used interchangeably. Latin-15 uses also character codes 166, 168, 180, 184 and 188-190 for letters.