Keyvan Nayyeri

God breathing through me

How to Split a Text into Words

This is the first part of a series of a few blog posts that I’m going to write about common string manipulation problems in .NET (especially C#) with some code snippets as their solutions with an emphasis on real world scenarios.

The first post is about splitting a text (paragraph, sentence or any piece of text) into words that are building it. This is obviously a common problem and of course, a problem that seems to be solved with some built-in string manipulation methods in .NET.

However, things are not always as easy as what they seem to be! So far I’ve seen some code snippets to accomplish this goal that relied on String.Split method and passing some common separator characters to it.

But this isn’t all the story! I can outline a few concerns about this implementation:

  • Splitting a text into its words isn’t equal to splitting it based on some common separator characters even though this may be an integral part of the implementation.
  • The list of separator characters is not as limited as what is used in many of these implementations.
  • The structure of some languages especially Eastern Asian languages (i.e. Chinese, Japanese and Korean) is slightly different from other languages. I’m not very familiar with these languages but after some studies, I found that word separation in Eastern languages is different from other languages.

So you see that splitting based on a constant set of separator characters is not a thorough solution.

But what’s my solution? My solution is as simple as the following function that gets a text as string and returns an array of string words.

public static string[] SplitIntoWords(string text)

{

    var delimiterString = @" ,.:;~!@#$%^&*(){}\/[]<>|'?؟-_+،""=";

 

    var separators = new List<char>();

    foreach (char ch in text.ToCharArray().ToList<char>())

    {

        if (char.IsSeparator(ch) || Convert.ToInt32(ch) > 2500)

            separators.Add(ch);

    }

 

    delimiterString += new string(separators.ToArray());

 

    var delimiter = delimiterString.ToCharArray();

 

    string[] words = null;

    if (!string.IsNullOrEmpty(text))

        words = text.Split(delimiter, StringSplitOptions.RemoveEmptyEntries);

 

    return words;

}

Let me describe this function shortly. The first step is to define a constant set of common separators that would be already familiar to you.

The second step is iterating through all the characters in the text and extracting any separator character using Char.IsSeparator function. The internal working of this function classifies characters in different groups. There is a UnicodeCategory enumeration in .NET and each character is classified in one of these groups. Char.IsSeparator method returns a true value for any character that is classified as SpaceSeparator, LineSeparator or ParagraphSeparator.

Besides, I checked for each character number and have considered it as a separator character if its number is larger than 2500. This number is a threshold for characters before starting the huge set of Eastern Asian language characters.

The third step is to merge two lists of separators and use it to split the text into its words.

This method as is, works for all the languages but doesn’t split text for Eastern Asian languages. Of course, it does exclude these languages from its result. Many of the existing implementations returns words and sentences from these languages as long string values.

Now I can use the below code to test my code snippet:

private void btnSplit_Click(object sender, EventArgs e)

{

    var words = SplitIntoWords(this.txtText.Text);

    this.txtWords.Text = string.Join("\n", words);

}

And it gives me my desired output:

Output

I may write a follow up about Eastern Asian languages with some details that may be nice to know.

14 Comments

Splitting Text into Words using C#
Sep 05, 2008 12:41 PM
#

You've been kicked (a good thing) - Trackback from DotNetKicks.com


Cohen
Sep 05, 2008 1:34 PM
#

Why not use a regex: with Regex("\w+").Matches(string) you catch all words. Or do I miss the point somewhere?


Keyvan Nayyeri
Sep 05, 2008 1:38 PM
#

@Cohen:

Yes, the point is with eastern languages. There would be some other exceptions as well.


Mikael Lundin
Sep 06, 2008 2:03 AM
#

But doesn't Cohen has a point though? Shouldn't we define "What is a word in this culture?" rather than "What is a delimiter for words in all cultures?"

While a word in english may contain a ’-sign, a word in swedish would never do so. The same should be applicable in eastern languages(?)

I'm not sure if there is one method to rule them all, but I would prefer specifying one "word definition" / language if I were to write a multi-language application and had to parse words out of strings.

Pingback from Dew Drop - September 6, 2008 | Alvin Ashcraft's Morning Dew

Pingback from Weekly Link Post 58 « Rhonda Tipton’s WebLog


ibrahim dursun
Sep 08, 2008 4:05 AM
#

here is one-liner:

return !string.IsNullOrEmpty(text) ? text.Split(text.Where(c => Char.IsSeparator(c) || Convert.ToInt32(c) > 2500).Union(@" ,.:;~!@#$%^&*(){}\/[]<>|'?؟-_+،""=").ToArray(), StringSplitOptions.RemoveEmptyEntries) : null;


Keyvan Nayyeri
Sep 08, 2008 6:52 AM
#

@Ibarhim:

Ok, what do you want to prove? Couldn't you keep your knowldege private?


ibrahim dursun
Sep 09, 2008 12:26 AM
#

no i cannot keep my knowledge private, i wanted to point out that this post is so worthless that made me sick.


Keyvan Nayyeri
Sep 09, 2008 6:53 AM
#

@Ibrahim:

So stay sick and don't bother others with your silly comments.


Text Searching
Oct 14, 2008 7:11 AM
#

Text Searching


kanasr
Mar 01, 2009 7:18 PM
#

hallo everyone

it ia a nice subject realy i need this tool but i dont have any idea

about visual studio can some one send it to me pleas.


bill
Mar 31, 2009 4:03 PM
#

does anyone have the code for the form shown above?


John Haugeland
Jul 13, 2009 7:27 PM
#

Unfortunately, Cohen is right: this mechanism produces incorrect results for more than half of Western scripts, including Polish and French, as well as basically every non-Western script. When taken to foreign scripts - not just Asian ones - the results are fundamentally broken. This misses essentially every set of foreign punctuation outside most (not all!) Western European scripts. This fails for Cyrillic, Hindi, Thai, anything from the Americas, et cetera.

There is a correct approach explained in UAX #24. An implementation of said correct approach will be in ScUtil soon (parts of it already are), at http://scutil.com/ .

Leave a Comment





Ads Powered by Lake Quincy Media Network