rspec/rules/S5868/java/rule.adoc

20 lines
1.1 KiB
Plaintext
Raw Normal View History

2021-04-28 16:49:39 +02:00
When placing Unicode https://unicode.org/glossary/#grapheme_cluster[Grapheme Clusters] (characters which require to be encoded in multiple https://unicode.org/glossary/#code_point[Code Points]) inside a character class of a regular expression, this will likely lead to unintended behavior.
For instance, the grapheme cluster ``++c̈++`` requires two code points: one for ``++'c'++``, followed by one for the _umlaut_ modifier ``++'\u{0308}'++``. If placed within a character class, such as ``++[c̈]++``, the regex will consider the character class being the enumeration ``++[c\u{0308}]++`` instead. It will, therefore, match every ``++'c'++`` and every _umlaut_ that isn't expressed as a single codepoint, which is extremely unlikely to be the intended behavior.
This rule raises an issue every time Unicode Grapheme Clusters are used within a character class of a regular expression.
== Noncompliant Code Example
----
"cc̈d̈d".replaceAll("[c̈d̈]", "X"); // Noncompliant, print "XXXXXX" instead of expected "cXXd".
----
== Compliant Solution
----
"cc̈d̈d".replaceAll("c̈|d̈", "X"); // print "cXXd"
----