Correlated Content

Replace Text in a Stream: String Replace

Let’s first look at the “simple, inefficient buffering” from the YARP documentation. In this method we just read the stream into a string and use string.Replace to do the work and write the result to the output:

What is the best way of doing this and how will it perform?

using var reader = new StreamReader(input, leaveOpen: true);
var original = await reader.ReadToEndAsync(cancellationToken);
var replaced = original.Replace(oldValue, newValue, comparisonType);
await using var writer = new StreamWriter(output, leaveOpen: true);
await writer.WriteAsync(replaced);

snippet source | anchor

Ordinal vs Linguistic Comparison

When using string.Replace you need to make two decisions: do I care about the casing and do I care about linguistic meaning of the strings. We already stipulated that we care about the casing. But what does the other choice mean?

  • Ordinal: compare the raw bytes of the string
  • Linguistic: compare the meaning of he string, either using a specific culture or the rules defined in the ‘invariant’ culture

Now the difference between these two options is explained in the documentation, but if your’re a more practical learner like me, you can see how both options behave in the following examples.

Sorting

var chars = new [] {"a", "b", "å", "c"};

var ordinalSort = chars.Order(StringComparer.Ordinal);
Assert.That(ordinalSort, Is.EqualTo(new[]{"a", "b", "c", "å" }));

var invariantSort = chars.Order(StringComparer.InvariantCulture);
Assert.That(invariantSort, Is.EqualTo(new[] { "a", "å", "b", "c" }));

snippet source | anchor

Comparing with the ordinal method, the å is placed after the c because the value U+00E5 is larger than U+0063.

Comparing

var separated = new string(['a', '\u030a']); // u030a = ̊  aka COMBINING RING ABOVE
var single = new string(['å']);

Assert.That(separated.Equals(single, StringComparison.Ordinal), Is.False);
Assert.That(separated.Equals(single, StringComparison.InvariantCulture), Is.True);

snippet source | anchor

Characters with diacritics can be written as one character or as the combination of the base character and the diacritic character. Linguistically å (U+00E5) is the same as a (U+0061) combined with ̊  (U+030A). But the first is just one character and the second is made up of two characters. And that is not the same when using ordinal.

Basically: ordinal is how computers would compare characters and linguistic is how humans would do it.

Benchmark

Now how do these options perform?

Method Mean Error StdDev Median Ratio RatioSD Gen0 Gen1 Gen2 Allocated Alloc Ratio
StringReplaceOrdinalIgnoreCase 10.363 ms 0.3083 ms 0.8944 ms 10.292 ms 1.00 0.00 1281.2500 1250.0000 453.1250 17.05 MB 1.00
StringReplaceOrdinal 9.532 ms 0.1889 ms 0.3309 ms 9.561 ms 0.88 0.06 1218.7500 1156.2500 390.6250 12.19 MB 0.71
StringReplaceInvariant 581.109 ms 14.2111 ms 40.3144 ms 571.019 ms 56.28 6.32 1000.0000 1000.0000 1000.0000 20.17 MB 1.18
StringReplaceInvariantIgnoreCase 654.296 ms 37.7361 ms 110.0780 ms 623.615 ms 63.69 13.64 1000.0000 1000.0000 1000.0000 20.18 MB 1.18

Unsurprisingly, the more complex option takes way longer than the simple byte comparison. But for all options memory usage is quite hight.

Since our scenario is focussed on replacing URLs in a stream, we don’t need the extra complexity of linguistic comparison. OrdinalIgnoreCase is a good fit. We’ll take that as the baseline for further tests.

Other Posts in this series

  1. Introduction
  2. String Replace
  3. StreamReader and StreamWriter
  4. Sidestep: Regex