amazon s3 - AWS EMR Spark save to S3 is very slow -
i have spark job running on emr takes unusually long time. spark tasks running fast. when save result s3 spends more 20mins doing this...
16/02/05 01:44:44 info latency: statuscode=[404], exception=[com.amazonaws.services.s3.model.amazons3exception: not found (service: amazon s3; status code: 404; error code: 404 not found; request id: 561ca7cd8c009e79), s3 extended request id: b3dmnykxe/qszsd1vrebf5fr+uh8m5k2tb8zz+y0+vfgqfswrjjpewv7wx61+9zijhy5nf35rx8=], servicename=[amazon s3], awserrorcode=[404 not found], awsrequestid=[561ca7cd8c009e79], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], exception=1, httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[12.766], httprequesttime=[12.494], httpclientreceiveresponsetime=[11.067], requestsigningtime=[0.103], credentialsrequesttime=[0.001], httpclientsendrequesttime=[0.071], 16/02/05 01:44:44 info latency: statuscode=[200], servicename=[amazon s3], awsrequestid=[f84316d0c1958276], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[16.001], httprequesttime=[13.1], httpclientreceiveresponsetime=[11.69], requestsigningtime=[0.085], credentialsrequesttime=[0.001], responseprocessingtime=[2.673], httpclientsendrequesttime=[0.071], 16/02/05 01:44:44 info s3nativefilesystem: rename s3://my-bucket-name/stati/data/output/bidder4/_temporary/0/task_201602050130_0014_m_000001/organization_id=100932/impression_date=2016-01-01/part-r-00001-0e84d8cb-4b43-4cc3-b95e-65b1b9c12f25.gz.parquet s3://my-bucket-name/stati/data/output/bidder4/organization_id=100932/impression_date=2016-01-01/part-r-00001-0e84d8cb-4b43-4cc3-b95e-65b1b9c12f25.gz.parquet 16/02/05 01:44:44 info latency: statuscode=[404], exception=[com.amazonaws.services.s3.model.amazons3exception: not found (service: amazon s3; status code: 404; error code: 404 not found; request id: 014934f9c27e2969), s3 extended request id: b313czevyzr21sbpxhodqs4gcrudu249jd5+z+d0a4fglhw6eqx0/grnttkrs2y4ucknd8dywyg=], servicename=[amazon s3], awserrorcode=[404 not found], awsrequestid=[014934f9c27e2969], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], exception=1, httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[11.854], httprequesttime=[11.598], httpclientreceiveresponsetime=[10.168], requestsigningtime=[0.098], credentialsrequesttime=[0.001], httpclientsendrequesttime=[0.078], 16/02/05 01:44:44 info latency: statuscode=[404], exception=[com.amazonaws.services.s3.model.amazons3exception: not found (service: amazon s3; status code: 404; error code: 404 not found; request id: 97fd09be9e109d68), s3 extended request id: ogopbseyzf9/7octzwyok+lcfalplbw+ioafxiybkshdtvmuyzeffogi7+qba6fo0rev1sl9fl4=], servicename=[amazon s3], awserrorcode=[404 not found], awsrequestid=[97fd09be9e109d68], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], exception=1, httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[13.141], httprequesttime=[12.864], httpclientreceiveresponsetime=[11.462], requestsigningtime=[0.098], credentialsrequesttime=[0.001], httpclientsendrequesttime=[0.057], 16/02/05 01:51:13 info latency: statuscode=[200], servicename=[amazon s3], awsrequestid=[7936d2099dd2eb95], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[8.471], httprequesttime=[8.209], httpclientreceiveresponsetime=[6.947], requestsigningtime=[0.09], credentialsrequesttime=[0.001], responseprocessingtime=[0.08], httpclientsendrequesttime=[0.042], 16/02/05 01:51:13 info s3nativefilesystem: liststatus s3://my-bucket-name/stati/data/output/bidder4/_temporary/0/task_201602050130_0014_m_000004/organization_id=101041 recursive false 16/02/05 01:51:13 info latency: statuscode=[404], exception=[com.amazonaws.services.s3.model.amazons3exception: not found (service: amazon s3; status code: 404; error code: 404 not found; request id: 4d2baed335e4dd56), s3 extended request id: ngmiu8r7x94wuhnyxhtb4aw0aipq9f1rhbmawsfsh/x8d1/b7efjawgo8z/eluj18pklvm7w2zq=], servicename=[amazon s3], awserrorcode=[404 not found], awsrequestid=[4d2baed335e4dd56], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], exception=1, httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[22.181], httprequesttime=[22.004], httpclientreceiveresponsetime=[20.697], requestsigningtime=[0.053], credentialsrequesttime=[0.0], httpclientsendrequesttime=[0.052], 16/02/05 01:51:13 info latency: statuscode=[200], servicename=[amazon s3], awsrequestid=[c554088e2b24a1f0], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[35.69], httprequesttime=[34.067], httpclientreceiveresponsetime=[32.718], requestsigningtime=[0.07], credentialsrequesttime=[0.0], responseprocessingtime=[1.447], httpclientsendrequesttime=[0.043], 16/02/05 01:51:14 info latency: statuscode=[404], exception=[com.amazonaws.services.s3.model.amazons3exception: not found (service: amazon s3; status code: 404; error code: 404 not found; request id: 3adae326d46195e2), s3 extended request id: peawu6ey5ngjdmshqqmhvyzqmvhjogefngu2bnash4a5o4qgubyum+tbliz2763pgizot2btaqc=], servicename=[amazon s3], awserrorcode=[404 not found], awsrequestid=[3adae326d46195e2], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], exception=1, httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[24.088], httprequesttime=[23.851], httpclientreceiveresponsetime=[22.466], requestsigningtime=[0.088], credentialsrequesttime=[0.0], httpclientsendrequesttime=[0.064], 16/02/05 01:51:14 info latency: statuscode=[404], exception=[com.amazonaws.services.s3.model.amazons3exception: not found (service: amazon s3; status code: 404; error code: 404 not found; request id: 069544819617c5f4), s3 extended request id: gomslqka0emliv+uo5zsjrxdhjxqbmvjmqybjmiqozuejppiup20rt/dqjzqrqpggde0dpzcr5q=], servicename=[amazon s3], awserrorcode=[404 not found], awsrequestid=[069544819617c5f4], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], exception=1, httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[51.626], httprequesttime=[51.351], httpclientreceiveresponsetime=[49.956], requestsigningtime=[0.081], credentialsrequesttime=[0.0], httpclientsendrequesttime=[0.05], 16/02/05 01:51:14 info latency: statuscode=[200], servicename=[amazon s3], awsrequestid=[e59c345260724310], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[14.771], httprequesttime=[13.035], httpclientreceiveresponsetime=[11.65], requestsigningtime=[0.092], credentialsrequesttime=[0.0], responseprocessingtime=[1.533], httpclientsendrequesttime=[0.072], 16/02/05 01:51:14 info s3nativefilesystem: liststatus s3://my-bucket-name/stati/data/output/bidder4/_temporary/0/task_201602050130_0014_m_000004/organization_id=101041/impression_date=2016-01-01 recursive false 16/02/05 01:51:14 info latency: statuscode=[404], exception=[com.amazonaws.services.s3.model.amazons3exception: not found (service: amazon s3; status code: 404; error code: 404 not found; request id: e1f7fdb93ab37e2f), s3 extended request id: tglj240gjvywm2bvi0msk4aah4c5kwk/8l6ujiw/ws/wxrkpeed3mfuax7pzwgvl8esef8ttcz8=], servicename=[amazon s3], awserrorcode=[404 not found], awsrequestid=[e1f7fdb93ab37e2f], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], exception=1, httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[9.351], httprequesttime=[9.166], httpclientreceiveresponsetime=[7.869], requestsigningtime=[0.071], credentialsrequesttime=[0.0], httpclientsendrequesttime=[0.04], 16/02/05 01:51:14 info latency: statuscode=[200], servicename=[amazon s3], awsrequestid=[2228f32badb3eac6], serviceendpoint=[https://my-bucket-name.s3.amazonaws.com], httpclientpoolleasedcount=0, requestcount=1, httpclientpoolpendingcount=0, httpclientpoolavailablecount=1, clientexecutetime=[21.51], httprequesttime=[19.992], httpclientreceiveresponsetime=[18.687], requestsigningtime=[0.047], credentialsrequesttime=[0.0], responseprocessingtime=[1.387], httpclientsendrequesttime=[0.057], 16/02/05 01:51:14 info latency: statuscode=[404], exception=[com.amazonaws.services.s3.model.amazons3exception: not found (service: amazon s3; status code: 404; error code: 404 not found; request id: daafa5b4b81aab0c), s3 extended request id: 5bgdszg4crvs0kn8s1hwdvpfknwqqzygs+qok0m6+u7k8hj3eupdeeyxmv6zt+dx1cqkngdv+/u=], servicename=[amazon s3], awserrorcode=[404 not found], awsrequestid=[daafa5b4
i thought similar another question. i've set in config.json file suggested there
{ "classification": "mapred-site", "properties": { "mapred.output.direct.emrfilesystem": "true", "mapred.output.direct.natives3filesystem": "true" } },
still seeing same behavior. emr 4.3.0.
the issue results being uploaded twice s3. take @ https://github.com/apache/spark/blob/branch-1.6/docs/sql-programming-guide.md. set hadoop property spark.sql.parquet.output.committer.class org.apache.spark.sql.parquet.directparquetoutputcommitter. note info regarding impact on speculative execution.
Comments
Post a Comment