How to Split a Text into Words
This is the first part of a series of a few blog posts that I’m going to write about common string manipulation problems in .NET (especially C#) with some code snippets as their solutions with an emphasis on real world scenarios.
The first post is about splitting a text (paragraph, sentence or any piece of text) into words that are building it. This is obviously a common problem and of course, a problem that seems to be solved with some built-in string manipulation methods in .NET.
However, things are not always as easy as what they seem to be! So far I’ve seen some code snippets to accomplish this goal that relied on String.Split method and passing some common separator characters to it.
But this isn’t all the story! I can outline a few concerns about this implementation:
- Splitting a text into its words isn’t equal to splitting it based on some common separator characters even though this may be an integral part of the implementation.
- The list of separator characters is not as limited as what is used in many of these implementations.
- The structure of some languages especially Eastern Asian languages (i.e. Chinese, Japanese and Korean) is slightly different from other languages. I’m not very familiar with these languages but after some studies, I found that word separation in Eastern languages is different from other languages.
So you see that splitting based on a constant set of separator characters is not a thorough solution.
But what’s my solution? My solution is as simple as the following function that gets a text as string and returns an array of string words.
public static string[] SplitIntoWords(string text)
{
var delimiterString = @" ,.:;~!@#$%^&*(){}\/[]<>|'?؟-_+،""=";
var separators = new List<char>();
foreach (char ch in text.ToCharArray().ToList<char>())
{
if (char.IsSeparator(ch) || Convert.ToInt32(ch) > 2500)
separators.Add(ch);
}
delimiterString += new string(separators.ToArray());
var delimiter = delimiterString.ToCharArray();
string[] words = null;
if (!string.IsNullOrEmpty(text))
words = text.Split(delimiter, StringSplitOptions.RemoveEmptyEntries);
return words;
}
Let me describe this function shortly. The first step is to define a constant set of common separators that would be already familiar to you.
The second step is iterating through all the characters in the text and extracting any separator character using Char.IsSeparator function. The internal working of this function classifies characters in different groups. There is a UnicodeCategory enumeration in .NET and each character is classified in one of these groups. Char.IsSeparator method returns a true value for any character that is classified as SpaceSeparator, LineSeparator or ParagraphSeparator.
Besides, I checked for each character number and have considered it as a separator character if its number is larger than 2500. This number is a threshold for characters before starting the huge set of Eastern Asian language characters.
The third step is to merge two lists of separators and use it to split the text into its words.
This method as is, works for all the languages but doesn’t split text for Eastern Asian languages. Of course, it does exclude these languages from its result. Many of the existing implementations returns words and sentences from these languages as long string values.
Now I can use the below code to test my code snippet:
private void btnSplit_Click(object sender, EventArgs e)
{
var words = SplitIntoWords(this.txtText.Text);
this.txtWords.Text = string.Join("\n", words);
}
And it gives me my desired output:
I may write a follow up about Eastern Asian languages with some details that may be nice to know.
[advertisement] Axosoft OnTime 2008 is four developer tools in one: bug tracking, project wiki, feature management, and help desk. It manages your development process so developers can focus on coding. Installed or Hosted – Free Single-user license -- Free 30-day team trial.
11 Comments : 09.05.08
Feedbacks
Why not use a regex: with Regex("\w+").Matches(string) you catch all words. Or do I miss the point somewhere?
@Cohen:
Yes, the point is with eastern languages. There would be some other exceptions as well.
But doesn't Cohen has a point though? Shouldn't we define "What is a word in this culture?" rather than "What is a delimiter for words in all cultures?"
While a word in english may contain a ’-sign, a word in swedish would never do so. The same should be applicable in eastern languages(?)
I'm not sure if there is one method to rule them all, but I would prefer specifying one "word definition" / language if I were to write a multi-language application and had to parse words out of strings.
Pingback from Dew Drop - September 6, 2008 | Alvin Ashcraft's Morning Dew
Pingback from Weekly Link Post 58 « Rhonda Tipton’s WebLog
here is one-liner:
return !string.IsNullOrEmpty(text) ? text.Split(text.Where(c => Char.IsSeparator(c) || Convert.ToInt32(c) > 2500).Union(@" ,.:;~!@#$%^&*(){}\/[]<>|'?؟-_+،""=").ToArray(), StringSplitOptions.RemoveEmptyEntries) : null;
@Ibarhim:
Ok, what do you want to prove? Couldn't you keep your knowldege private?
no i cannot keep my knowledge private, i wanted to point out that this post is so worthless that made me sick.
@Ibrahim:
So stay sick and don't bother others with your silly comments.
Text Searching

#1
Splitting Text into Words using C#
09.05.2008 @ 12:41 PM
You've been kicked (a good thing) - Trackback from DotNetKicks.com