9/17/2023 0 Comments Airflow docker requirements.txt![]() ![]() In this article, the process is that we are going to extract data from the Restful API, transform the JSON data into a data frame, and load the data into an S3 bucket, then we will go a step further to schedule and run our ETL process on Airflow. ![]() There is another blog post about that here, where we showed how to run Spark and Airflow directly on your machine. Interestingly, you can run Airflow and Spark locally, but we are not going to show it in this blog post. In the Github repository, we have provided Dockerfiles for Spark, and Airflow, the files will allow you to extend and design the Docker image based on what your use cases are. Also, on the Airflow instance, Java is very much required before Airflow can run and schedule Spark jobs successfully. ![]() One of the problems we encountered in this process was that Airflow was looking for the necessary Spark dependencies required to run, and submit Spark jobs to Airflow. In our case, I was working to schedule Spark jobs with Airflow sitting on a Docker container. You can package your deployment suite, create a Spark image from this and run your applications on containers. You can extend the Docker image of Spark and Airflow using Dockerfiles by including dependencies, and versions of Java, Spark and Scala in relation to external systems needs. We are not going to look at the Airflow Docker images in this blog post, but there is comprehensive documentation on how to set up AIrflow on Docker images here.įor this blog post, we would be using the one provided by Bitnami Spark because it has more features and is easy to use. I have not been able to test and explore all of these images personally, but I tested the Spark image provided by Datamechanics and Bitnami Spark which works for the ETL workflow provided here. In our case, I was able to try the images provided by Bitnami Spark and Datamechanics, although these companies have the highest downloads on the Dockerhub, there are still many different options. Interestingly, different companies and individuals have published Spark images on Dockerhub, you can pull the images, and try them out to see if they address your development use case. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |