python tesseract results giving unwanted extra line gaps between sentences -


i performing ocr operation tesseract. have written simple python wrapper that. problem getting unwanted line gaps between sentences in end text file, need remove programmatically. example:

1 tbsp peanut or corn oil, plus little cooking scallops  2 tbsp bottled mild or medium thai green curry paste 2 tbsp water  2 tsp light soy sauce 

please note line gaps--which need remove. please share tips if experienced similar problems. thank you.

here wrapper:

from pil import image import subprocess import os wand.image import image import markdown2 textblob import textblob  import util import errors  tesseract_exe = "tesseract" # name of executable called @ command line scratch_text_name_root = "temp" # leave out .txt extension cleanup_scratch_flag = true # temporary files cleaned after ocr operation pagesegmode = "-psm 0"   def call_tesseract(input_file, output_file):     args = [tesseract_exe, input_file, output_file, pagesegmode]     proc = subprocess.popen(args)     retcode = proc.wait()     if retcode !=0:         errors.check_for_errors()   def retrieve_text(scratch_text_name_root):     inf = file(scratch_text_name_root + '.txt')     text = inf.read()     inf.close()     return text  def write_to_file(filename, string):     file = open(filename, 'w')     file.write(string)     file.close()   def image_to_string(filename):     try:         call_tesseract(filename, scratch_text_name_root)         text = retrieve_text(scratch_text_name_root)     finally:         try:             os.remove(scratch_text_name_root)         except oserror:             pass          return text      filename = "book/0001.bin.png" text = image_to_string(filename) print "writing file" write_to_file("0002.bin.txt", text) 

im not sure why tesseract gives these empty lines, maybe simple workaround you:

just remove these empty lines. there many ways this, example here: https://stackoverflow.com/a/3711884/4175009

or here:

https://stackoverflow.com/a/2369474/4175009

these solutions both suppose read file line line.

i solution because can use driectly @ finished string , handles os differences in line endings (\n, \n\r, \r\n).


Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -