Leveraging Containers for Reproducible Data Science: Understand how containerisation can help ensure reproducibility in data science workflows and enable easier collaboration across teams.

6 min readMar 15, 2023

Reproducibility is a crucial aspect of data science, enabling researchers to verify and build upon previous work. However, reproducing data science workflows can be complex and challenging, especially as data grows in volume and complexity. One way to address this challenge is by leveraging containers, which enable data scientists to package their work with all dependencies, ensuring that it can be run consistently across different systems. In this blog post, we’ll explore how containerisation can help ensure reproducibility in data science workflows and enable easier collaboration across teams.

What is Containerisation?

Containerisation is a technique used to package software applications and their dependencies into lightweight, portable containers. Containers are isolated from each other and the host system, enabling them to run consistently across different environments. Containerisation is often used in DevOps and software engineering to package and deploy applications, but it can also be used in data science workflows.

Fig 1: Basic Containerisation Architecture (Source: https://www.docker.com/resources/what-container/)

Why Use Containers for Data Science?

Using containers for data science has several benefits, including:

Reproducibility: Containers ensure that data science workflows can be run consistently across different environments, which is crucial for reproducibility.
Portability: Containers are portable and can be run on any system that supports the containerisation platform, which makes it easy to share and distribute data science work.
Isolation: Containers are isolated from each other and the host system, preventing conflicts between dependencies and ensuring that data science work is reproducible.
Scalability: Containers can be easily scaled up or down to handle different workloads, which makes it easy to run data science work on large datasets or complex models.
Collaboration: Containers can be shared with other team members, which makes it easier to collaborate on data science work and enables knowledge sharing.

How to Use Containers for Data Science

To use containers for data science, we must create a container image containing all the dependencies and code required to run the workflow. We can create a container image using a containerisation platform like Docker, a popular venue for creating and managing containers.

To create a container image, we need to create a Dockerfile, a script specifying the dependencies and commands required to build the container. Here’s an example Dockerfile for a data science workflow that uses Python and the scikit-learn library:

# Base image
FROM python:3.8-slim-buster

# Set working directory
WORKDIR /app

# Copy requirements file
COPY requirements.txt .

# Install dependencies
RUN pip install -r requirements.txt

# Copy source code
COPY . .

# Set entry point
ENTRYPOINT [ "python", "my_script.py" ]docker

In this example, we’re using the Python 3.8 slim-buster image as the base image, installing the scikit-learn library, and setting the entry point to our Python script, my_script.py. We also copy the source code, and requirements file into the container.

Once we’ve created the Dockerfile, we can build the container image using the docker build Command:

docker build -t my_container .

This command will build a container image with the tag my_container. We can then run the container using the docker run Command:

docker run my_container

This command will run the container and execute the entry point command, which in our case, is python my_script.py. We can also pass arguments to the entry point command by specifying them after the container name:

docker run my_container arg1 arg2

In addition to the benefits mentioned above, containerisation allows data scientists to easily switch between different environments, such as testing and production, without worrying about compatibility issues. Furthermore, with containerisation, we can quickly create separate containers for each domain, each with its dependencies and configuration.

Another advantage of containerisation is that it can help simplify the process of deploying data science workflows to the cloud. Packaging our code and dependencies into a container allows us to quickly deploy it to a cloud platform like Amazon Web Services or Microsoft Azure, which can be scaled up or down as needed.

One everyday use case for containerisation in data science is building machine learning models. By packaging our code and dependencies into a container, we can ensure that the model is reproducible and can be run consistently across different environments. We can also easily share the container with other team members or deploy it to the cloud.

Packaging and Running a Docker Application in the Cloud

Now that we have a container image for our data science workflow, we can deploy it in the cloud. Several cloud providers support Docker, including Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). In this example, we’ll deploy our container on AWS using Elastic Container Service (ECS).

To deploy our container on ECS, we need to create a task definition, which specifies the container image to use, the resources required by the container, and the networking configuration. Here are some additional details on creating a task definition in AWS:

First, navigate to the ECS console in your AWS account.
From the navigation pane, click on “Task Definitions”.
Next, click on the “Create new Task Definition” button.
Next, select the launch type you want to use (EC2 or Fargate).
Next, select a task execution role if you have one already created, or create a new one.
Finally, choose the task size (CPU and memory) that you need for your application.
Add one or more container definitions. Each container definition specifies an image to use and various container settings, such as the port mappings and environment variables.
Configure the network mode, which determines how the containers communicate with each other and external resources.
Add volume definitions to mount shared or external storage to the container.
Configure the task placement constraints and strategies, such as how the tasks should be distributed across availability zones.
Define the task launch type, which determines how the tasks should be started and stopped, and configure any required settings.
Review your task definition, make any necessary changes, and then save it.

Some important things to know when creating a task definition in AWS include:

Container images: When creating a container definition, specify the Docker image you want to use. This image must be stored in a container registry that AWS can access, such as Amazon ECR or Docker Hub.
Networking: When configuring the network mode, choose the appropriate mode based on how your containers will communicate. For example, if you want to use a load balancer, you would select the “awsvpc” network mode, which allows each container to have its own network interface and IP address.
Task execution role: This role provides permissions to access resources your containers need, such as logs or data stored in Amazon S3. You can create a new position or choose an existing role during task definition creation.
Task placement: You can configure task placement constraints and strategies to control where your tasks are placed in your cluster. For example, you can specify that functions should be evenly distributed across availability zones to improve fault tolerance.

By following these steps and understanding the essential considerations, you can create a task definition in AWS that meets the needs of your application and enables you to run containerised workloads efficiently.

Here’s an example task definition for our data science workflow:

{
  "family": "my_task_definition",
  "taskRoleArn": "arn:aws:iam::123456789012:role/my_task_role",
  "containerDefinitions": [
    {
      "name": "my_container",
      "image": "my_registry/my_container",
      "cpu": 256,
      "memory": 512,
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8000,
          "hostPort": 8000
        }
      ]
    }
  ],
  "networkMode": "awsvpc",
  "executionRoleArn": "arn:aws:iam::123456789012:role/my_task_execution_role",
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "cpu": "256",
  "memory": "512"
}

In this example, we specify the container image to use (my_registry/my_container), the CPU and memory resources the container requires, and the networking configuration. We also define that the container should listen on port 8000, which we’ll use to access our data science workflow.

Once we’ve created the task definition, we can create a service running our container on ECS. Here’s an example service definition:

{
  "serviceName": "my_service",
  "taskDefinition": "my_task_definition",
  "loadBalancers": [
    {
      "targetGroupArn": "arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my_target_group/abcd1234",
      "containerName": "my_container",
      "containerPort": 8000
    }
  ],
  "desiredCount": 1,
  "launchType": "FARGATE",
  "networkConfiguration": {
    "awsvpcConfiguration": {
      "subnets": [
        "subnet-1234abcd",
        "subnet-5678efgh"
      ],
      "securityGroups": [
        "sg-1234abcd"
      ],
      "assignPublicIp": "ENABLED"
    }
  }
}

In this example, we specify the task definition, the number of tasks to run (1), and the networking configuration. We also determine that the container should be accessible through a load balancer, which we’ve configured to listen on port 8000.

Once we’ve created the service, ECS will automatically deploy our container and start running our data science workflow. We can access the workflow using the load balancer URL provided by ECS.

Conclusion

Containerisation can ensure reproducibility in data science workflows and enable easier collaboration across teams. By packaging data science work in containers, we can ensure that it runs consistently across different systems, making it easier to reproduce and share work. Additionally, deploying containers in the cloud makes it easy to scale up and down to handle different workloads. With containerisation, data scientists can focus on their work without worrying about system dependencies or deployment issues.