String Replace - Correlated Content

Replace Text in a Stream: String Replace

June 9, 2024

Let’s first look at the “simple, inefficient buffering” from the YARP documentation. In this method we just read the stream into a string and use string.Replace to do the work and write the result to the output:

What is the best way of doing this and how will it perform?

using var reader = new StreamReader(input, leaveOpen: true);
var original = await reader.ReadToEndAsync(cancellationToken);
var replaced = original.Replace(oldValue, newValue, comparisonType);
await using var writer = new StreamWriter(output, leaveOpen: true);
await writer.WriteAsync(replaced);

^{snippet source | anchor}

Ordinal vs Linguistic Comparison

When using string.Replace you need to make two decisions: do I care about the casing and do I care about linguistic meaning of the strings. We already stipulated that we care about the casing. But what does the other choice mean?

Ordinal: compare the raw bytes of the string
Linguistic: compare the meaning of he string, either using a specific culture or the rules defined in the ‘invariant’ culture

Now the difference between these two options is explained in the documentation, but if your’re a more practical learner like me, you can see how both options behave in the following examples.

Sorting

var chars = new [] {"a", "b", "å", "c"};

var ordinalSort = chars.Order(StringComparer.Ordinal);
Assert.That(ordinalSort, Is.EqualTo(new[]{"a", "b", "c", "å" }));

var invariantSort = chars.Order(StringComparer.InvariantCulture);
Assert.That(invariantSort, Is.EqualTo(new[] { "a", "å", "b", "c" }));

^{snippet source | anchor}

Comparing with the ordinal method, the å is placed after the c because the value U+00E5 is larger than U+0063.

Comparing

var separated = new string(['a', '\u030a']); // u030a = ̊  aka COMBINING RING ABOVE
var single = new string(['å']);

Assert.That(separated.Equals(single, StringComparison.Ordinal), Is.False);
Assert.That(separated.Equals(single, StringComparison.InvariantCulture), Is.True);

^{snippet source | anchor}

Characters with diacritics can be written as one character or as the combination of the base character and the diacritic character. Linguistically å (U+00E5) is the same as a (U+0061) combined with ̊ (U+030A). But the first is just one character and the second is made up of two characters. And that is not the same when using ordinal.

Basically: ordinal is how computers would compare characters and linguistic is how humans would do it.

Benchmark

Now how do these options perform?

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
StringReplaceOrdinalIgnoreCase	10.363 ms	0.3083 ms	0.8944 ms	10.292 ms	1.00	0.00	1281.2500	1250.0000	453.1250	17.05 MB	1.00
StringReplaceOrdinal	9.532 ms	0.1889 ms	0.3309 ms	9.561 ms	0.88	0.06	1218.7500	1156.2500	390.6250	12.19 MB	0.71
StringReplaceInvariant	581.109 ms	14.2111 ms	40.3144 ms	571.019 ms	56.28	6.32	1000.0000	1000.0000	1000.0000	20.17 MB	1.18
StringReplaceInvariantIgnoreCase	654.296 ms	37.7361 ms	110.0780 ms	623.615 ms	63.69	13.64	1000.0000	1000.0000	1000.0000	20.18 MB	1.18

Unsurprisingly, the more complex option takes way longer than the simple byte comparison. But for all options memory usage is quite hight.

Since our scenario is focussed on replacing URLs in a stream, we don’t need the extra complexity of linguistic comparison. OrdinalIgnoreCase is a good fit. We’ll take that as the baseline for further tests.

Replace Text in a Stream: String Replace

Ordinal vs Linguistic Comparison

Sorting

Comparing

Benchmark

Other Posts in this series