Replace Text in a Stream: String Replace
Let’s first look at the “simple, inefficient buffering” from the
YARP documentation. In this method we just read the stream into a string
and use string.Replace
to do the work and write the result to the output:
What is the best way of doing this and how will it perform?
using var reader = new StreamReader(input, leaveOpen: true);
var original = await reader.ReadToEndAsync(cancellationToken);
var replaced = original.Replace(oldValue, newValue, comparisonType);
await using var writer = new StreamWriter(output, leaveOpen: true);
await writer.WriteAsync(replaced);
Ordinal vs Linguistic Comparison
When using string.Replace
you need to make two decisions: do I care about the
casing and do I care about linguistic meaning of the strings. We already
stipulated that we care about the casing. But what does the other choice mean?
- Ordinal: compare the raw bytes of the string
- Linguistic: compare the meaning of he string, either using a specific culture or the rules defined in the ‘invariant’ culture
Now the difference between these two options is explained in the documentation, but if your’re a more practical learner like me, you can see how both options behave in the following examples.
Sorting
var chars = new [] {"a", "b", "å", "c"};
var ordinalSort = chars.Order(StringComparer.Ordinal);
Assert.That(ordinalSort, Is.EqualTo(new[]{"a", "b", "c", "å" }));
var invariantSort = chars.Order(StringComparer.InvariantCulture);
Assert.That(invariantSort, Is.EqualTo(new[] { "a", "å", "b", "c" }));
Comparing with the ordinal method, the å
is placed after the c
because
the value U+00E5
is larger than U+0063
.
Comparing
var separated = new string(['a', '\u030a']); // u030a = ̊ aka COMBINING RING ABOVE
var single = new string(['å']);
Assert.That(separated.Equals(single, StringComparison.Ordinal), Is.False);
Assert.That(separated.Equals(single, StringComparison.InvariantCulture), Is.True);
Characters with diacritics can be written as one character or as the
combination of the base character and the diacritic character.
Linguistically å
(U+00E5
) is the same as a
(U+0061
) combined with
̊
(U+030A
).
But the first is just one character and the second is made up of two
characters. And that is not the same when using ordinal.
Basically: ordinal is how computers would compare characters and linguistic is how humans would do it.
Benchmark
Now how do these options perform?
Method | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|
StringReplaceOrdinalIgnoreCase | 10.363 ms | 0.3083 ms | 0.8944 ms | 10.292 ms | 1.00 | 0.00 | 1281.2500 | 1250.0000 | 453.1250 | 17.05 MB | 1.00 |
StringReplaceOrdinal | 9.532 ms | 0.1889 ms | 0.3309 ms | 9.561 ms | 0.88 | 0.06 | 1218.7500 | 1156.2500 | 390.6250 | 12.19 MB | 0.71 |
StringReplaceInvariant | 581.109 ms | 14.2111 ms | 40.3144 ms | 571.019 ms | 56.28 | 6.32 | 1000.0000 | 1000.0000 | 1000.0000 | 20.17 MB | 1.18 |
StringReplaceInvariantIgnoreCase | 654.296 ms | 37.7361 ms | 110.0780 ms | 623.615 ms | 63.69 | 13.64 | 1000.0000 | 1000.0000 | 1000.0000 | 20.18 MB | 1.18 |
Unsurprisingly, the more complex option takes way longer than the simple byte comparison. But for all options memory usage is quite hight.
Since our scenario is focussed on replacing URLs in a stream, we don’t need the
extra complexity of linguistic comparison. OrdinalIgnoreCase
is a good fit.
We’ll take that as the baseline for further tests.