I'm Keyvan Nayyeri, a 25 years old Ph.D. student at
the Computer Science department of
the University of Texas at San Antonio.
I'm also
a Software Architect and Developer and previously held a B.Sc.
degree in Applied Mathematics.
This is my blog where I publish content about various topics specifically Programming Languages and Compilers, Software
Engineering and Programming.
This is the first part of a series of a few blog posts that I’m going to write about common string manipulation problems in .NET (especially C#) with some code snippets as their solutions with an emphasis on real world scenarios.
The first post is about splitting a text (paragraph, sentence or any piece of text) into words that are building it. This is obviously a common problem and of course, a problem that seems to be solved with some built-in string manipulation methods in .NET.
However, things are not always as easy as what they seem to be! So far I’ve seen some code snippets to accomplish this goal that relied on String.Split method and passing some common separator characters to it.
But this isn’t all the story! I can outline a few concerns about this implementation:
So you see that splitting based on a constant set of separator characters is not a thorough solution.
But what’s my solution? My solution is as simple as the following function that gets a text as string and returns an array of string words.
public static string[] SplitIntoWords(string text)
{
var delimiterString = @" ,.:;~!@#$%^&*(){}\/[]<>|'?؟-_+،""=";
var separators = new List<char>();
foreach (char ch in text.ToCharArray().ToList<char>())
{
if (char.IsSeparator(ch) || Convert.ToInt32(ch) > 2500)
separators.Add(ch);
}
delimiterString += new string(separators.ToArray());
var delimiter = delimiterString.ToCharArray();
string[] words = null;
if (!string.IsNullOrEmpty(text))
words = text.Split(delimiter, StringSplitOptions.RemoveEmptyEntries);
return words;
}
Let me describe this function shortly. The first step is to define a constant set of common separators that would be already familiar to you.
The second step is iterating through all the characters in the text and extracting any separator character using Char.IsSeparator function. The internal working of this function classifies characters in different groups. There is a UnicodeCategory enumeration in .NET and each character is classified in one of these groups. Char.IsSeparator method returns a true value for any character that is classified as SpaceSeparator, LineSeparator or ParagraphSeparator.
Besides, I checked for each character number and have considered it as a separator character if its number is larger than 2500. This number is a threshold for characters before starting the huge set of Eastern Asian language characters.
The third step is to merge two lists of separators and use it to split the text into its words.
This method as is, works for all the languages but doesn’t split text for Eastern Asian languages. Of course, it does exclude these languages from its result. Many of the existing implementations returns words and sentences from these languages as long string values.
Now I can use the below code to test my code snippet:
private void btnSplit_Click(object sender, EventArgs e)
{
var words = SplitIntoWords(this.txtText.Text);
this.txtWords.Text = string.Join("\n", words);
}
And it gives me my desired output:
I may write a follow up about Eastern Asian languages with some details that may be nice to know.
Splitting Text into Words using C#
Sep 05, 2008 12:41 PM
#
You've been kicked (a good thing) - Trackback from DotNetKicks.com
Cohen
Sep 05, 2008 1:34 PM
#
Why not use a regex: with Regex("\w+").Matches(string) you catch all words. Or do I miss the point somewhere?
Keyvan Nayyeri
Sep 05, 2008 1:38 PM
#
@Cohen:
Yes, the point is with eastern languages. There would be some other exceptions as well.
Mikael Lundin
Sep 06, 2008 2:03 AM
#
But doesn't Cohen has a point though? Shouldn't we define "What is a word in this culture?" rather than "What is a delimiter for words in all cultures?"
While a word in english may contain a ’-sign, a word in swedish would never do so. The same should be applicable in eastern languages(?)
I'm not sure if there is one method to rule them all, but I would prefer specifying one "word definition" / language if I were to write a multi-language application and had to parse words out of strings.
Dew Drop - September 6, 2008 | Alvin Ashcraft's Morning Dew
Sep 06, 2008 10:31 AM
#
Pingback from Dew Drop - September 6, 2008 | Alvin Ashcraft's Morning Dew
Weekly Link Post 58 « Rhonda Tipton’s WebLog
Sep 07, 2008 5:54 PM
#
Pingback from Weekly Link Post 58 « Rhonda Tipton’s WebLog
ibrahim dursun
Sep 08, 2008 4:05 AM
#
here is one-liner:
return !string.IsNullOrEmpty(text) ? text.Split(text.Where(c => Char.IsSeparator(c) || Convert.ToInt32(c) > 2500).Union(@" ,.:;~!@#$%^&*(){}\/[]<>|'?؟-_+،""=").ToArray(), StringSplitOptions.RemoveEmptyEntries) : null;
Keyvan Nayyeri
Sep 08, 2008 6:52 AM
#
@Ibarhim:
Ok, what do you want to prove? Couldn't you keep your knowldege private?
ibrahim dursun
Sep 09, 2008 12:26 AM
#
no i cannot keep my knowledge private, i wanted to point out that this post is so worthless that made me sick.
Keyvan Nayyeri
Sep 09, 2008 6:53 AM
#
@Ibrahim:
So stay sick and don't bother others with your silly comments.
Text Searching
Oct 14, 2008 7:11 AM
#
Text Searching
kanasr
Mar 01, 2009 7:18 PM
#
hallo everyone
it ia a nice subject realy i need this tool but i dont have any idea
about visual studio can some one send it to me pleas.
bill
Mar 31, 2009 4:03 PM
#
does anyone have the code for the form shown above?
John Haugeland
Jul 13, 2009 7:27 PM
#
Unfortunately, Cohen is right: this mechanism produces incorrect results for more than half of Western scripts, including Polish and French, as well as basically every non-Western script. When taken to foreign scripts - not just Asian ones - the results are fundamentally broken. This misses essentially every set of foreign punctuation outside most (not all!) Western European scripts. This fails for Cyrillic, Hindi, Thai, anything from the Americas, et cetera.
There is a correct approach explained in UAX #24. An implementation of said correct approach will be in ScUtil soon (parts of it already are), at http://scutil.com/ .
Leave a Comment