python - String preprocessing -


i'm dealing list of strings may contain additional letters original spelling, example:

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday'] 

i want pre-process these strings spelt correctly, retrieve new list:

cleaned_words = ['why', 'hey', 'alright', 'cool', 'monday'] 

the length of sequence of duplicated letter can vary, however, cool should maintain spelling.

i'm unaware of python libraries this, , i'd preferably try , avoid hard coding it.

i've tried this: http://norvig.com/spell-correct.html more words put in text file, seems there's more chance of suggesting incorrect spelling, it's never getting right, without removed additional letters. example, eel becomes teel...

thanks in advance.

if download text file of english words check against, way work.

i've not tested idea. iterates through letters, , if current letter matches last one, it'll remove letter word. if narrows down letters 1, , there still no valid word, it'll reset word normal , continue until next duplicate characters found.

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday'] import urllib2 word_list = set(i.lower() in urllib2.urlopen('https://raw.githubusercontent.com/eneko/data-repository/master/data/words.txt').read().split('\n'))  found_words = [] word in (i.lower() in words):      #check word doesn't exist     if word in word_list:         found_words.append(word)         continue      last_char = none     = 0     current_word = word     while < len(current_word):          #check if it's duplicate character         if current_word[i] == last_char:             current_word = current_word[:i] + current_word[i + 1:]          #reset word if no more duplicate characters         else:             current_word = word             += 1             last_char = current_word[i]          #word has been found         if current_word in word_list:             found_words.append(current_word)             break  print found_words #['why', 'hey', 'alright', 'cool', 'monday'] 

Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -