How to Split a Text into Words

This is the first part of a series of a few blog posts that I’m going to write about common string manipulation problems in .NET (especially C#) with some code snippets as their solutions with an emphasis on real world scenarios.

The first post is about splitting a text (paragraph, sentence or any piece of text) into words that are building it. This is obviously a common problem and of course, a problem that seems to be solved with some built-in string manipulation methods in .NET.

However, things are not always as easy as what they seem to be! So far I’ve seen some code snippets to accomplish this goal that relied on String.Split method and passing some common separator characters to it.

But this isn’t all the story! I can outline a few concerns about this implementation:

So you see that splitting based on a constant set of separator characters is not a thorough solution.

But what’s my solution? My solution is as simple as the following function that gets a text as string and returns an array of string words.

public static string[] SplitIntoWords(string text)

{

    var delimiterString = @" ,.:;~!@#$%^&*(){}\/[]<>|'?؟-_+،""=";

 

    var separators = new List<char>();

    foreach (char ch in text.ToCharArray().ToList<char>())

    {

        if (char.IsSeparator(ch) || Convert.ToInt32(ch) > 2500)

            separators.Add(ch);

    }

 

    delimiterString += new string(separators.ToArray());

 

    var delimiter = delimiterString.ToCharArray();

 

    string[] words = null;

    if (!string.IsNullOrEmpty(text))

        words = text.Split(delimiter, StringSplitOptions.RemoveEmptyEntries);

 

    return words;

}

Let me describe this function shortly. The first step is to define a constant set of common separators that would be already familiar to you.

The second step is iterating through all the characters in the text and extracting any separator character using Char.IsSeparator function. The internal working of this function classifies characters in different groups. There is a UnicodeCategory enumeration in .NET and each character is classified in one of these groups. Char.IsSeparator method returns a true value for any character that is classified as SpaceSeparator, LineSeparator or ParagraphSeparator.

Besides, I checked for each character number and have considered it as a separator character if its number is larger than 2500. This number is a threshold for characters before starting the huge set of Eastern Asian language characters.

The third step is to merge two lists of separators and use it to split the text into its words.

This method as is, works for all the languages but doesn’t split text for Eastern Asian languages. Of course, it does exclude these languages from its result. Many of the existing implementations returns words and sentences from these languages as long string values.

Now I can use the below code to test my code snippet:

private void btnSplit_Click(object sender, EventArgs e)

{

    var words = SplitIntoWords(this.txtText.Text);

    this.txtWords.Text = string.Join("\n", words);

}

And it gives me my desired output:

Output

I may write a follow up about Eastern Asian languages with some details that may be nice to know.

[advertisement] Axosoft OnTime 2008 is four developer tools in one: bug tracking, project wiki, feature management, and help desk. It manages your development process so developers can focus on coding. Installed or Hosted – Free Single-user license -- Free 30-day team trial.

11 Comments : 09.05.08

Feedbacks

 avatar
#1
Splitting Text into Words using C#
09.05.2008 @ 12:41 PM

You've been kicked (a good thing) - Trackback from DotNetKicks.com

 avatar
#2
Cohen
09.05.2008 @ 1:34 PM

Why not use a regex: with Regex("\w+").Matches(string) you catch all words. Or do I miss the point somewhere?

admin avatar
#3
Keyvan Nayyeri
09.05.2008 @ 1:38 PM

@Cohen:

Yes, the point is with eastern languages. There would be some other exceptions as well.

 avatar
#4
Mikael Lundin
09.06.2008 @ 2:03 AM

But doesn't Cohen has a point though? Shouldn't we define "What is a word in this culture?" rather than "What is a delimiter for words in all cultures?"

While a word in english may contain a ’-sign, a word in swedish would never do so. The same should be applicable in eastern languages(?)

I'm not sure if there is one method to rule them all, but I would prefer specifying one "word definition" / language if I were to write a multi-language application and had to parse words out of strings.

Pingback from Dew Drop - September 6, 2008 | Alvin Ashcraft's Morning Dew

Pingback from Weekly Link Post 58 « Rhonda Tipton’s WebLog

 avatar
#7
ibrahim dursun
09.08.2008 @ 4:05 AM

here is one-liner:

return !string.IsNullOrEmpty(text) ? text.Split(text.Where(c => Char.IsSeparator(c) || Convert.ToInt32(c) > 2500).Union(@" ,.:;~!@#$%^&*(){}\/[]<>|'?؟-_+،""=").ToArray(), StringSplitOptions.RemoveEmptyEntries) : null;

admin avatar
#8
Keyvan Nayyeri
09.08.2008 @ 6:52 AM

@Ibarhim:

Ok, what do you want to prove? Couldn't you keep your knowldege private?

 avatar
#9
ibrahim dursun
09.09.2008 @ 12:26 AM

no i cannot keep my knowledge private, i wanted to point out that this post is so worthless that made me sick.

admin avatar
#10
Keyvan Nayyeri
09.09.2008 @ 6:53 AM

@Ibrahim:

So stay sick and don't bother others with your silly comments.

 avatar
#11
Text Searching
10.14.2008 @ 7:11 AM

Text Searching

Leave a Comment