rspec/rules/S5868/java/rule.adoc

When placing Unicode https://unicode.org/glossary/#grapheme_cluster[Grapheme Clusters] (characters which require to be encoded in multiple https://unicode.org/glossary/#code_point[Code Points]) inside a character class of a regular expression, this will likely lead to unintended behavior.


For instance, the grapheme cluster ``++c̈++`` requires two code points: one for ``++'c'++``, followed by one for the _umlaut_ modifier ``++'\u{0308}'++``. If placed within a character class, such as ``++[c̈]++``, the regex will consider the character class being the enumeration ``++[c\u{0308}]++`` instead. It will, therefore, match every ``++'c'++`` and every _umlaut_ that isn't expressed as a single codepoint, which is extremely unlikely to be the intended behavior.


This rule raises an issue every time Unicode Grapheme Clusters are used within a character class of a regular expression.

== Noncompliant Code Example

----
"cc̈d̈d".replaceAll("[c̈d̈]", "X"); // Noncompliant, print "XXXXXX" instead of expected "cXXd".
----

== Compliant Solution

----
"cc̈d̈d".replaceAll("c̈|d̈", "X"); // print "cXXd"
----
Restructure one-lang rspecs 2021-04-28 16:49:39 +02:00			`When placing Unicode https://unicode.org/glossary/#grapheme_cluster[Grapheme Clusters] (characters which require to be encoded in multiple https://unicode.org/glossary/#code_point[Code Points]) inside a character class of a regular expression, this will likely lead to unintended behavior.`


			For instance, the grapheme cluster ``++c̈++`` requires two code points: one for ``++'c'++``, followed by one for the _umlaut_ modifier ``++'\u{0308}'++``. If placed within a character class, such as ``++[c̈]++``, the regex will consider the character class being the enumeration ``++[c\u{0308}]++`` instead. It will, therefore, match every ``++'c'++`` and every _umlaut_ that isn't expressed as a single codepoint, which is extremely unlikely to be the intended behavior.


			`This rule raises an issue every time Unicode Grapheme Clusters are used within a character class of a regular expression.`

			`== Noncompliant Code Example`

			`----`
			`"cc̈d̈d".replaceAll("[c̈d̈]", "X"); // Noncompliant, print "XXXXXX" instead of expected "cXXd".`
			`----`

			`== Compliant Solution`

			`----`
			`"cc̈d̈d".replaceAll("c̈\|d̈", "X"); // print "cXXd"`
			`----`