Can't get Spark to work on IPython Notebook in Windows -


i have installed spark on windows 10 box, , installation works fine pyspark console. have tried configure ipython notebook work spark installation. have made following imports

os.environ['spark_home'] = "e:/spark/spark-1.6.0-bin-hadoop2.6" sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/bin") sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/python") sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/python/pyspark") sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/python/lib") sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip") sys.path.append("e:/spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-    src.zip") sys.path.append("c:/program files/java/jdk1.8.0_51/bin") 

this works fine creating sparkcontext , code like

sc.parallelize([1, 2, 3]) 

but when write following

file = sc.textfile("e:/scripts.sql") words = sc.count() 

i following error

py4jjavaerror traceback (most recent call last) <ipython-input-22-3c172daac960> in <module>()  1 file = sc.textfile("e:/scripts.sql")  ----> 2 file.count()   e:/spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in count(self)  1002         3  1003         """  -> 1004         return self.mappartitions(lambda i: [sum(1 _ in i)]).sum()  1005   1006     def stats(self):   e:/spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in sum(self)  993         6.0  994         """  --> 995         return self.mappartitions(lambda x: [sum(x)]).fold(0, operator.add)  996   997     def count(self):   e:/spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in fold(self, zerovalue, op)  867         # zerovalue provided each partition unique 1 provided  868         # final reduce call  --> 869         vals = self.mappartitions(func).collect()  870         return reduce(op, vals, zerovalue)  871    e:/spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in collect(self)  769         """  770         sccallsitesync(self.context) css:  --> 771             port =     self.ctx._jvm.pythonrdd.collectandserve(self._jrdd.rdd())  772         return list(_load_from_socket(port, self._jrdd_deserializer))  773    e:\spark\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py in __call__(self, *args)  811         answer = self.gateway_client.send_command(command)  812         return_value = get_return_value(  --> 813             answer, self.gateway_client, self.target_id, self.name)  814   815         temp_arg in temp_args:   e:\spark\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)  306                 raise py4jjavaerror(  307                     "an error occurred while calling {0}{1}{2}.\n".  --> 308                     format(target_id, ".", name), value) 309             else: 310                 raise py4jerror(py4jjavaerror: error occurred while calling     z:org.apache.spark.api.python.pythonrdd.collectandserve. : org.apache.spark.sparkexception: job aborted due stage failure: task 0 in stage 8.0 failed 1 times, recent failure: lost task 0.0 in stage 8.0 (tid 8, localhost): org.apache.spark.sparkexception: python worker did not connect in time @   org.apache.spark.api.python.pythonworkerfactory.createsimpleworker(pythonworkerfactory.scala:136) @ org.apache.spark.api.python.pythonworkerfactory.create(pythonworkerfactory.scala:65) @ org.apache.spark.sparkenv.createpythonworker(sparkenv.scala:134) @ org.apache.spark.api.python.pythonrunner.compute(pythonrdd.scala:101) @ org.apache.spark.api.python.pythonrdd.compute(pythonrdd.scala:70) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:306) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:270) @ org.apache.spark.scheduler.resulttask.runtask(resulttask.scala:66) @ org.apache.spark.scheduler.task.run(task.scala:89) @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:213) @ java.util.concurrent.threadpoolexecutor.runworker(unknown source) @ java.util.concurrent.threadpoolexecutor$worker.run(unknown source) @ java.lang.thread.run(unknown source) caused by: java.net.sockettimeoutexception: accept timed out @ java.net.dualstackplainsocketimpl.waitfornewconnection(native method) @ java.net.dualstackplainsocketimpl.socketaccept(unknown source) @ java.net.abstractplainsocketimpl.accept(unknown source) @ java.net.plainsocketimpl.accept(unknown source) @ java.net.serversocket.implaccept(unknown source) @ java.net.serversocket.accept(unknown source) @ org.apache.spark.api.python.pythonworkerfactory.createsimpleworker(pythonworkerfactory.scala:131) ... 12 more  driver stacktrace: @ org.apache.spark.scheduler.dagscheduler.org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages(dagscheduler.scala:1431) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1419) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1418) @ scala.collection.mutable.resizablearray$class.foreach(resizablearray.scala:59) @ scala.collection.mutable.arraybuffer.foreach(arraybuffer.scala:47) @ org.apache.spark.scheduler.dagscheduler.abortstage(dagscheduler.scala:1418) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:799) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:799) @ scala.option.foreach(option.scala:236) @ org.apache.spark.scheduler.dagscheduler.handletasksetfailed(dagscheduler.scala:799) @ org.apache.spark.scheduler.dagschedulereventprocessloop.doonreceive(dagscheduler.scala:1640) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1599) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1588) @ org.apache.spark.util.eventloop$$anon$1.run(eventloop.scala:48) @ org.apache.spark.scheduler.dagscheduler.runjob(dagscheduler.scala:620) @ org.apache.spark.sparkcontext.runjob(sparkcontext.scala:1832) @ org.apache.spark.sparkcontext.runjob(sparkcontext.scala:1845) @ org.apache.spark.sparkcontext.runjob(sparkcontext.scala:1858) @ org.apache.spark.sparkcontext.runjob(sparkcontext.scala:1929) @ org.apache.spark.rdd.rdd$$anonfun$collect$1.apply(rdd.scala:927) @ org.apache.spark.rdd.rddoperationscope$.withscope(rddoperationscope.scala:150) @ org.apache.spark.rdd.rddoperationscope$.withscope(rddoperationscope.scala:111) @ org.apache.spark.rdd.rdd.withscope(rdd.scala:316) @ org.apache.spark.rdd.rdd.collect(rdd.scala:926) @ org.apache.spark.api.python.pythonrdd$.collectandserve(pythonrdd.scala:405) @ org.apache.spark.api.python.pythonrdd.collectandserve(pythonrdd.scala) @ sun.reflect.nativemethodaccessorimpl.invoke0(native method) @ sun.reflect.nativemethodaccessorimpl.invoke(unknown source) @ sun.reflect.delegatingmethodaccessorimpl.invoke(unknown source) @ java.lang.reflect.method.invoke(unknown source) @ py4j.reflection.methodinvoker.invoke(methodinvoker.java:231) @ py4j.reflection.reflectionengine.invoke(reflectionengine.java:381) @ py4j.gateway.invoke(gateway.java:259) @ py4j.commands.abstractcommand.invokemethod(abstractcommand.java:133) @ py4j.commands.callcommand.execute(callcommand.java:79) @ py4j.gatewayconnection.run(gatewayconnection.java:209) @ java.lang.thread.run(unknown source) caused by: org.apache.spark.sparkexception: python worker did not connect in time @ org.apache.spark.api.python.pythonworkerfactory.createsimpleworker(pythonworkerfactory.scala:136) @ org.apache.spark.api.python.pythonworkerfactory.create(pythonworkerfactory.scala:65) @ org.apache.spark.sparkenv.createpythonworker(sparkenv.scala:134) @ org.apache.spark.api.python.pythonrunner.compute(pythonrdd.scala:101) @ org.apache.spark.api.python.pythonrdd.compute(pythonrdd.scala:70) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:306) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:270) @ org.apache.spark.scheduler.resulttask.runtask(resulttask.scala:66) @ org.apache.spark.scheduler.task.run(task.scala:89) @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:213) @ java.util.concurrent.threadpoolexecutor.runworker(unknown source) @ java.util.concurrent.threadpoolexecutor$worker.run(unknown source) ... 1 more caused by: java.net.sockettimeoutexception: accept timed out @ java.net.dualstackplainsocketimpl.waitfornewconnection(native method) @ java.net.dualstackplainsocketimpl.socketaccept(unknown source) @ java.net.abstractplainsocketimpl.accept(unknown source) @ java.net.plainsocketimpl.accept(unknown source) @ java.net.serversocket.implaccept(unknown source) @ java.net.serversocket.accept(unknown source) @ org.apache.spark.api.python.pythonworkerfactory.createsimpleworker(pythonworkerfactory.scala:131) ... 12 more 

please resolve on short time project.

try escaping backslashes.

file = sc.textfile("e:\\scripts.sql") 

edited add second item at:

also, notice called:

words = sc.count() 

try instead, worked on windows 10 install:

file = sc.textfile("e:/scripts.sql") words = file.count() 

Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -