Memory Efficiency and Performance of String.Replace .NET Framework

All characters in a .NET string are "unicode chars". Do you mean they're non-ascii? That shouldn't make any odds - unless you run into composition issues, e.g. an "e + acute accent" not being replaced when you try to replace an "e acute".

You could try using a regular expression with Regex.Replace, or StringBuilder.Replace. Here's sample code doing the same thing with both:

using System;
using System.Text;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string original = "abcdefghijkl";

        Regex regex = new Regex("a|c|e|g|i|k", RegexOptions.Compiled);

        string removedByRegex = regex.Replace(original, "");
        string removedByStringBuilder = new StringBuilder(original)
            .Replace("a", "")
            .Replace("c", "")
            .Replace("e", "")
            .Replace("g", "")
            .Replace("i", "")
            .Replace("k", "")
            .ToString();

        Console.WriteLine(removedByRegex);
        Console.WriteLine(removedByStringBuilder);
    }
}

I wouldn't like to guess which is more efficient - you'd have to benchmark with your specific application. The regex way may be able to do it all in one pass, but that pass will be relatively CPU-intensive compared with each of the many replaces in StringBuilder.


1. What is the fastest way of replacing these values, ignoring memory concerns?

The fastest way is to build a custom component that's specific to your use case. As of .NET 4.6, There's no class in the BCL designed for multiple string replacements.

If you NEED something fast out of the BCL, StringBuilder is the fastest BCL component for simple string replacement. The source code can be found here: It's pretty efficient for replacing a single string. Only use Regex if you really need the pattern-matching power of regular expressions. It's slower and a little more cumbersome, even when compiled.

2. What is the most memory efficient way of achieving the same result?

The most memory-efficient way is to perform a filtered stream copy from the source to the destination (explained below). Memory consumption will be limited to your buffer, however this will be more CPU intensive; as a rule of thumb, you're going to trade CPU performance for memory consumption.

Technical Details

String replacements are tricky. Even when performing a string replacement in a mutable memory space (such as with StringBuilder), it's expensive. If the replacement string is a different length than original string, you're going to be relocating every character following the replacement string to keep the whole string contiguous. This results in a LOT of memory writes, and even in the case of StringBuilder, causes you to rewrite most of the string in-memory on every call to Replace.

So what is the fastest way to do string replacements? Write the new string using a single-pass: Don't let your code go back and have to re-write anything. Writes are more expensive than reads. You're going to have to code this yourself for best results.

High-Memory Solution

The class I've written generates strings based on templates. I place tokens ($ReplaceMe$) in a template which marks places where I want to insert a string later. I use it in cases where XmlWriter is too onerous for XML that's largely static and repetitive, and I need to produce large XML (or JSON) data streams.

The class works by slicing the template up into parts and places each part into a numbered dictionary. Parameters are also enumerated. The order in which the parts and parameters are inserted into a new string are placed into an integer array. When a new string is generated, the parts and parameters are picked from the dictionary and used to create a new string.

It's neither fully-optimized nor is it bulletproof, but it works great for generating very large data streams from templates.

Low-Memory Solution

You'll need to read small chunks from the source string into a buffer, search the buffer using an optimized search algorithm, and then write the new string to the destination stream / string. There are a lot of potential caveats here, but it would be memory efficient and a better solution for source data that's dynamic and can't be cached, such as whole-page translations or source data that's too large to reasonably cache. I don't have a sample solution for this handy.

Sample Code

Desired Results

<DataTable source='Users'>
  <Rows>
    <Row id='25' name='Administrator' />
    <Row id='29' name='Robert' />
    <Row id='55' name='Amanda' />
  </Rows>
</DataTable>

Template

<DataTable source='$TableName$'>
  <Rows>
    <Row id='$0$' name='$1$'/>
  </Rows>
</DataTable>

Test Case

class Program
{
  static string[,] _users =
  {
    { "25", "Administrator" },
    { "29", "Robert" },
    { "55", "Amanda" },
  };

  static StringTemplate _documentTemplate = new StringTemplate(@"<DataTable source='$TableName$'><Rows>$Rows$</Rows></DataTable>");
  static StringTemplate _rowTemplate = new StringTemplate(@"<Row id='$0$' name='$1$' />");
  static void Main(string[] args)
  {
    _documentTemplate.SetParameter("TableName", "Users");
    _documentTemplate.SetParameter("Rows", GenerateRows);

    Console.WriteLine(_documentTemplate.GenerateString(4096));
    Console.ReadLine();
  }

  private static void GenerateRows(StreamWriter writer)
  {
    for (int i = 0; i <= _users.GetUpperBound(0); i++)
      _rowTemplate.GenerateString(writer, _users[i, 0], _users[i, 1]);
  }
}

StringTemplate Source

public class StringTemplate
{
  private string _template;
  private string[] _parts;
  private int[] _tokens;
  private string[] _parameters;
  private Dictionary<string, int> _parameterIndices;
  private string[] _replaceGraph;
  private Action<StreamWriter>[] _callbackGraph;
  private bool[] _graphTypeIsReplace;

  public string[] Parameters
  {
    get { return _parameters; }
  }

  public StringTemplate(string template)
  {
    _template = template;
    Prepare();
  }

  public void SetParameter(string name, string replacement)
  {
    int index = _parameterIndices[name] + _parts.Length;
    _replaceGraph[index] = replacement;
    _graphTypeIsReplace[index] = true;
  }

  public void SetParameter(string name, Action<StreamWriter> callback)
  {
    int index = _parameterIndices[name] + _parts.Length;
    _callbackGraph[index] = callback;
    _graphTypeIsReplace[index] = false;
  }

  private static Regex _parser = new Regex(@"\$(\w{1,64})\$", RegexOptions.Compiled);
  private void Prepare()
  {
    _parameterIndices = new Dictionary<string, int>(64);
    List<string> parts = new List<string>(64);
    List<object> tokens = new List<object>(64);
    int param_index = 0;
    int part_start = 0;

    foreach (Match match in _parser.Matches(_template))
    {
      if (match.Index > part_start)
      {
        //Add Part
        tokens.Add(parts.Count);
        parts.Add(_template.Substring(part_start, match.Index - part_start));
      }


      //Add Parameter
      var param = _template.Substring(match.Index + 1, match.Length - 2);
      if (!_parameterIndices.TryGetValue(param, out param_index))
        _parameterIndices[param] = param_index = _parameterIndices.Count;
      tokens.Add(param);

      part_start = match.Index + match.Length;
    }

    //Add last part, if it exists.
    if (part_start < _template.Length)
    {
      tokens.Add(parts.Count);
      parts.Add(_template.Substring(part_start, _template.Length - part_start));
    }

    //Set State
    _parts = parts.ToArray();
    _tokens = new int[tokens.Count];

    int index = 0;
    foreach (var token in tokens)
    {
      var parameter = token as string;
      if (parameter == null)
        _tokens[index++] = (int)token;
      else
        _tokens[index++] = _parameterIndices[parameter] + _parts.Length;
    }

    _parameters = _parameterIndices.Keys.ToArray();
    int graphlen = _parts.Length + _parameters.Length;
    _callbackGraph = new Action<StreamWriter>[graphlen];
    _replaceGraph = new string[graphlen];
    _graphTypeIsReplace = new bool[graphlen];

    for (int i = 0; i < _parts.Length; i++)
    {
      _graphTypeIsReplace[i] = true;
      _replaceGraph[i] = _parts[i];
    }
  }

  public void GenerateString(Stream output)
  {
    var writer = new StreamWriter(output);
    GenerateString(writer);
    writer.Flush();
  }

  public void GenerateString(StreamWriter writer)
  {
    //Resolve graph
    foreach(var token in _tokens)
    {
      if (_graphTypeIsReplace[token])
        writer.Write(_replaceGraph[token]);
      else
        _callbackGraph[token](writer);
    }
  }

  public void SetReplacements(params string[] parameters)
  {
    int index;
    for (int i = 0; i < _parameters.Length; i++)
    {
      if (!Int32.TryParse(_parameters[i], out index))
        continue;
      else
        SetParameter(index.ToString(), parameters[i]);
    }
  }

  public string GenerateString(int bufferSize = 1024)
  {
    using (var ms = new MemoryStream(bufferSize))
    {
      GenerateString(ms);
      ms.Position = 0;
      using (var reader = new StreamReader(ms))
        return reader.ReadToEnd();
    }
  }

  public string GenerateString(params string[] parameters)
  {
    SetReplacements(parameters);
    return GenerateString();
  }

  public void GenerateString(StreamWriter writer, params string[] parameters)
  {
    SetReplacements(parameters);
    GenerateString(writer);
  }
}

StringBuilder: http://msdn.microsoft.com/en-us/library/2839d5h5.aspx

The performance of the Replace operation itself should be roughly same as string.Replace and according to Microsoft no garbage should be produced.


If you want to be really fast, and I mean really fast you'll have to look beyond the StringBuilder and just write well optimized code.

One thing your computer doesn't like to do is branching, if you can write a replace method which operates on a fixed array (char *) and doesn't branch you have great performance.

What you'll be doing is that the replace operation is going to search for a sequence of characters and if it finds any such sub string it will replace it. In effect you'll copy the string and when doing so, preform the find and replace.

You'll rely on these functions for picking the index of some buffer to read/write. The goal is to preform the replace method such that when nothing has to change you write junk instead of branching.

You should be able to complete this without a single if statement and remember to use unsafe code. Otherwise you'll be paying for index checking for every element access.

unsafe
{
    fixed( char * p = myStringBuffer )
    {
        // Do fancy string manipulation here
    }
}

I've written code like this in C# for fun and seen significant performance improvements, almost 300% speed up for find and replace. While the .NET BCL (base class library) performs quite well it is riddled with branching constructs and exception handling this will slow down you code if you use the built-in stuff. Also these optimizations while perfectly sound are not preformed by the JIT-compiler and you'll have to run the code as a release build without any debugger attached to be able to observe the massive performance gain.

I could provide you with more complete code but it is a substantial amount of work. However, I can guarantee you that it will be faster than anything else suggested so far.

Tags:

C#

.Net

String