9 lines
891 B
Plaintext
9 lines
891 B
Plaintext
When placing Unicode https://unicode.org/glossary/#grapheme_cluster[Grapheme Clusters] (characters which require to be encoded in multiple https://unicode.org/glossary/#code_point[Code Points]) inside a character class of a regular expression, this will likely lead to unintended behavior.
|
|
|
|
|
|
For instance, the grapheme cluster ``++c̈++`` requires two code points: one for ``++'c'++``, followed by one for the _umlaut_ modifier ``++'\u{0308}'++``. If placed within a character class, such as ``++[c̈]++``, the regex will consider the character class being the enumeration ``++[c\u{0308}]++`` instead. It will, therefore, match every ``++'c'++`` and every _umlaut_ that isn't expressed as a single codepoint, which is extremely unlikely to be the intended behavior.
|
|
|
|
|
|
This rule raises an issue every time Unicode Grapheme Clusters are used within a character class of a regular expression.
|
|
|