Can't get Spark to work on IPython Notebook in Windows -
i have installed spark on windows 10 box, , installation works fine pyspark console. have tried configure ipython notebook work spark installation. have made following imports
os.environ['spark_home'] = "e:/spark/spark-1.6.0-bin-hadoop2.6" sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/bin") sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/python") sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark") sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/python/lib") sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip") sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9- src.zip") sys.path.append("c:/program files/java/jdk1.8.0_51/bin")
this works fine creating sparkcontext , code like
sc.parallelize([1, 2, 3])
but when write following
file = sc.textfile("e:/scripts.sql") words = sc.count()
i following error
py4jjavaerror traceback (most recent call last) <ipython-input-22-3c172daac960> in <module>() 1 file = sc.textfile("e:/scripts.sql") ----> 2 file.count() e:/spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in count(self) 1002 3 1003 """ -> 1004 return self.mappartitions(lambda i: [sum(1 _ in i)]).sum() 1005 1006 def stats(self): e:/spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in sum(self) 993 6.0 994 """ --> 995 return self.mappartitions(lambda x: [sum(x)]).fold(0, operator.add) 996 997 def count(self): e:/spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in fold(self, zerovalue, op) 867 # zerovalue provided each partition unique 1 provided 868 # final reduce call --> 869 vals = self.mappartitions(func).collect() 870 return reduce(op, vals, zerovalue) 871 e:/spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in collect(self) 769 """ 770 sccallsitesync(self.context) css: --> 771 port = self.ctx._jvm.pythonrdd.collectandserve(self._jrdd.rdd()) 772 return list(_load_from_socket(port, self._jrdd_deserializer)) 773 e:\spark\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py in __call__(self, *args) 811 answer = self.gateway_client.send_command(command) 812 return_value = get_return_value( --> 813 answer, self.gateway_client, self.target_id, self.name) 814 815 temp_arg in temp_args: e:\spark\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise py4jjavaerror( 307 "an error occurred while calling {0}{1}{2}.\n". --> 308 format(target_id, ".", name), value) 309 else: 310 raise py4jerror(py4jjavaerror: error occurred while calling z:org.apache.spark.api.python.pythonrdd.collectandserve. : org.apache.spark.sparkexception: job aborted due stage failure: task 0 in stage 8.0 failed 1 times, recent failure: lost task 0.0 in stage 8.0 (tid 8, localhost): org.apache.spark.sparkexception: python worker did not connect in time @ org.apache.spark.api.python.pythonworkerfactory.createsimpleworker(pythonworkerfactory.scala:136) @ org.apache.spark.api.python.pythonworkerfactory.create(pythonworkerfactory.scala:65) @ org.apache.spark.sparkenv.createpythonworker(sparkenv.scala:134) @ org.apache.spark.api.python.pythonrunner.compute(pythonrdd.scala:101) @ org.apache.spark.api.python.pythonrdd.compute(pythonrdd.scala:70) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:306) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:270) @ org.apache.spark.scheduler.resulttask.runtask(resulttask.scala:66) @ org.apache.spark.scheduler.task.run(task.scala:89) @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:213) @ java.util.concurrent.threadpoolexecutor.runworker(unknown source) @ java.util.concurrent.threadpoolexecutor$worker.run(unknown source) @ java.lang.thread.run(unknown source) caused by: java.net.sockettimeoutexception: accept timed out @ java.net.dualstackplainsocketimpl.waitfornewconnection(native method) @ java.net.dualstackplainsocketimpl.socketaccept(unknown source) @ java.net.abstractplainsocketimpl.accept(unknown source) @ java.net.plainsocketimpl.accept(unknown source) @ java.net.serversocket.implaccept(unknown source) @ java.net.serversocket.accept(unknown source) @ org.apache.spark.api.python.pythonworkerfactory.createsimpleworker(pythonworkerfactory.scala:131) ... 12 more driver stacktrace: @ org.apache.spark.scheduler.dagscheduler.org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages(dagscheduler.scala:1431) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1419) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1418) @ scala.collection.mutable.resizablearray$class.foreach(resizablearray.scala:59) @ scala.collection.mutable.arraybuffer.foreach(arraybuffer.scala:47) @ org.apache.spark.scheduler.dagscheduler.abortstage(dagscheduler.scala:1418) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:799) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:799) @ scala.option.foreach(option.scala:236) @ org.apache.spark.scheduler.dagscheduler.handletasksetfailed(dagscheduler.scala:799) @ org.apache.spark.scheduler.dagschedulereventprocessloop.doonreceive(dagscheduler.scala:1640) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1599) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1588) @ org.apache.spark.util.eventloop$$anon$1.run(eventloop.scala:48) @ org.apache.spark.scheduler.dagscheduler.runjob(dagscheduler.scala:620) @ org.apache.spark.sparkcontext.runjob(sparkcontext.scala:1832) @ org.apache.spark.sparkcontext.runjob(sparkcontext.scala:1845) @ org.apache.spark.sparkcontext.runjob(sparkcontext.scala:1858) @ org.apache.spark.sparkcontext.runjob(sparkcontext.scala:1929) @ org.apache.spark.rdd.rdd$$anonfun$collect$1.apply(rdd.scala:927) @ org.apache.spark.rdd.rddoperationscope$.withscope(rddoperationscope.scala:150) @ org.apache.spark.rdd.rddoperationscope$.withscope(rddoperationscope.scala:111) @ org.apache.spark.rdd.rdd.withscope(rdd.scala:316) @ org.apache.spark.rdd.rdd.collect(rdd.scala:926) @ org.apache.spark.api.python.pythonrdd$.collectandserve(pythonrdd.scala:405) @ org.apache.spark.api.python.pythonrdd.collectandserve(pythonrdd.scala) @ sun.reflect.nativemethodaccessorimpl.invoke0(native method) @ sun.reflect.nativemethodaccessorimpl.invoke(unknown source) @ sun.reflect.delegatingmethodaccessorimpl.invoke(unknown source) @ java.lang.reflect.method.invoke(unknown source) @ py4j.reflection.methodinvoker.invoke(methodinvoker.java:231) @ py4j.reflection.reflectionengine.invoke(reflectionengine.java:381) @ py4j.gateway.invoke(gateway.java:259) @ py4j.commands.abstractcommand.invokemethod(abstractcommand.java:133) @ py4j.commands.callcommand.execute(callcommand.java:79) @ py4j.gatewayconnection.run(gatewayconnection.java:209) @ java.lang.thread.run(unknown source) caused by: org.apache.spark.sparkexception: python worker did not connect in time @ org.apache.spark.api.python.pythonworkerfactory.createsimpleworker(pythonworkerfactory.scala:136) @ org.apache.spark.api.python.pythonworkerfactory.create(pythonworkerfactory.scala:65) @ org.apache.spark.sparkenv.createpythonworker(sparkenv.scala:134) @ org.apache.spark.api.python.pythonrunner.compute(pythonrdd.scala:101) @ org.apache.spark.api.python.pythonrdd.compute(pythonrdd.scala:70) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:306) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:270) @ org.apache.spark.scheduler.resulttask.runtask(resulttask.scala:66) @ org.apache.spark.scheduler.task.run(task.scala:89) @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:213) @ java.util.concurrent.threadpoolexecutor.runworker(unknown source) @ java.util.concurrent.threadpoolexecutor$worker.run(unknown source) ... 1 more caused by: java.net.sockettimeoutexception: accept timed out @ java.net.dualstackplainsocketimpl.waitfornewconnection(native method) @ java.net.dualstackplainsocketimpl.socketaccept(unknown source) @ java.net.abstractplainsocketimpl.accept(unknown source) @ java.net.plainsocketimpl.accept(unknown source) @ java.net.serversocket.implaccept(unknown source) @ java.net.serversocket.accept(unknown source) @ org.apache.spark.api.python.pythonworkerfactory.createsimpleworker(pythonworkerfactory.scala:131) ... 12 more
please resolve on short time project.
try escaping backslashes.
file = sc.textfile("e:\\scripts.sql")
edited add second item at:
also, notice called:
words = sc.count()
try instead, worked on windows 10 install:
file = sc.textfile("e:/scripts.sql") words = file.count()
Comments
Post a Comment