python - String preprocessing -
i'm dealing list of strings may contain additional letters original spelling, example:
words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']
i want pre-process these strings spelt correctly, retrieve new list:
cleaned_words = ['why', 'hey', 'alright', 'cool', 'monday']
the length of sequence of duplicated letter can vary, however, cool
should maintain spelling.
i'm unaware of python libraries this, , i'd preferably try , avoid hard coding it.
i've tried this: http://norvig.com/spell-correct.html more words put in text file, seems there's more chance of suggesting incorrect spelling, it's never getting right, without removed additional letters. example, eel
becomes teel
...
thanks in advance.
if download text file of english words check against, way work.
i've not tested idea. iterates through letters, , if current letter matches last one, it'll remove letter word. if narrows down letters 1, , there still no valid word, it'll reset word normal , continue until next duplicate characters found.
words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday'] import urllib2 word_list = set(i.lower() in urllib2.urlopen('https://raw.githubusercontent.com/eneko/data-repository/master/data/words.txt').read().split('\n')) found_words = [] word in (i.lower() in words): #check word doesn't exist if word in word_list: found_words.append(word) continue last_char = none = 0 current_word = word while < len(current_word): #check if it's duplicate character if current_word[i] == last_char: current_word = current_word[:i] + current_word[i + 1:] #reset word if no more duplicate characters else: current_word = word += 1 last_char = current_word[i] #word has been found if current_word in word_list: found_words.append(current_word) break print found_words #['why', 'hey', 'alright', 'cool', 'monday']
Comments
Post a Comment