python tesseract results giving unwanted extra line gaps between sentences -
i performing ocr operation tesseract. have written simple python wrapper that. problem getting unwanted line gaps between sentences in end text file, need remove programmatically. example:
1 tbsp peanut or corn oil, plus little cooking scallops 2 tbsp bottled mild or medium thai green curry paste 2 tbsp water 2 tsp light soy sauce
please note line gaps--which need remove. please share tips if experienced similar problems. thank you.
here wrapper:
from pil import image import subprocess import os wand.image import image import markdown2 textblob import textblob import util import errors tesseract_exe = "tesseract" # name of executable called @ command line scratch_text_name_root = "temp" # leave out .txt extension cleanup_scratch_flag = true # temporary files cleaned after ocr operation pagesegmode = "-psm 0" def call_tesseract(input_file, output_file): args = [tesseract_exe, input_file, output_file, pagesegmode] proc = subprocess.popen(args) retcode = proc.wait() if retcode !=0: errors.check_for_errors() def retrieve_text(scratch_text_name_root): inf = file(scratch_text_name_root + '.txt') text = inf.read() inf.close() return text def write_to_file(filename, string): file = open(filename, 'w') file.write(string) file.close() def image_to_string(filename): try: call_tesseract(filename, scratch_text_name_root) text = retrieve_text(scratch_text_name_root) finally: try: os.remove(scratch_text_name_root) except oserror: pass return text filename = "book/0001.bin.png" text = image_to_string(filename) print "writing file" write_to_file("0002.bin.txt", text)
im not sure why tesseract gives these empty lines, maybe simple workaround you:
just remove these empty lines. there many ways this, example here: https://stackoverflow.com/a/3711884/4175009
or here:
https://stackoverflow.com/a/2369474/4175009
these solutions both suppose read file line line.
i solution because can use driectly @ finished string , handles os differences in line endings (\n, \n\r, \r\n).
Comments
Post a Comment