Hey! I recorded a video course!

My first video course is out and it's called: Build an MVP with Elixir
If you like this article, you will also like the course! Check it out here!

Analyzing large amounts of data has become an everyday problem. Oftentimes, the datasets are only available as CSV file, which creates the the question of how you can import them into your Postgres database? The short answer: By using the Postgres’ COPY-function. Here’s the long answer:

Let’s imagine you have an Ecto.Schema called Location with the following definition:

schema "locations" do
  field :name, :string
  field :latitude, :float
  field :longitude, :float
end

The locations table stores location data as latitude and longitude coordinates together with a name. For example, this could be a street address with a house number and its geocoded lat+long position.

Now, let’s imagine you have a CSV file called locations.csv with 100.000 rows of location data and you want to import the data into your locations table. In the following, we will use Postgres’ COPY-function for that. First, we will call it directly from a psql-session and then we will wrap it into a simple Mix.Task. Let’s go!

🔗 Most efficient: Postgres’s COPY

The fastest way of importing your data is to use Postgres’ COPY function. You can use it in two ways: If your CSV file exists in the same file-system as your Postgres database, you can ask Postgres to simply move the data into the database. However, this won’t work when you want to dump the data into a hosted database since the file exists only in your local filesystem. In that case, you can either use COPY’s little brother, the \copy-command, or we leverage a Postgrex-connection to stream the file through our Postgres session. We’ll see how we can do that using Elixir later on.

In all cases, Postgres’ COPY is mind-bogglingly fast, but it has some limitations:

inserted_at and updated_at timestamps won’t be auto-generated. They need to exist in your CSV file. Importing a file without these timestamps will fail because of the Not-Null-constraints on the timestamp fields. If you need timestamps, but can’t add them to the file, consider adding your timestamp fields after you imported the data with for example:

now = NaiveDateTime.utc_now()
Repo.update_all(Location, set: [inserted_at: now, updated_at: now])

Importing complex associations is tricky. If your data has associations with other tables and you need to validate that e.g. an association on the other table exists, you have to do so manually after the data is imported. Postgres’ COPY-command works best for “dumb” data that doesn’t depend on other tables.

Now, with these limitations in mind, let’s have a look at how you can use Postgres’ COPY-command.

🔗 COPY-ing without Elixir

First, double-check that your CSV file has the same column order as your database. Basically, your CSV should order the fields the same way your Ecto.Schema defines them. In our case, our CSV needs to order the data as name,latitude,longitude.

Next, open a psql-session to your database. If you have Postgres running locally, use this command:

psql -h localhost -p 5432 -U postgres -d locations_dev

Make sure to replace the parameters above with your own parameters. Pay special attention to the database name behind the -d flag. This should equal the name of the database that you want to import the CSV to.

Next, let’s import the CSV file with the following command:

app_dev=# COPY locations(name,latitude,longitude) 
          FROM './relative-or-absolute-path-to-your-file/data.csv' 
          WITH CSV HEADER DELIMITER ',';
COPY 69638

This command took only a second to run and imported almost 70k rows of location data! Amazing!

When using the COPY-command, Postgres moves the data using your file-system into the database. This is basic file copying and therefore is very fast. If you connected to a hosted database, this might take a bit longer because you first have to upload the file to the database server.

We used a few options in our command.

  • The CSV told Postgres that we’ll import a CSV file. You could also import text or binary data.
  • The HEADER option tells Postgres that our file has a header row which it should skip. If your file doesn’t have a header row, you should remove this option.
  • The DELIMITER option informs Postgres about which character separates the values in our CSV file. Usually, this is a ,, but sometimes this might be a ; or something even fancier. If your file is comma-separated (,), you can remove this option.

Here are a few troubleshooting tips, if this command didn’t work for you:

  1. The file can’t be found. You probably run Postgres in a Docker container or connected to a hosted Postgres instance. In that case, your file doesn’t exist in the same file-system as your database and you first need to copy the file to the Postgres instance. Postgres will do this for you if you replace COPY with \copy.
  2. My file doesn’t have headers. Easy. Simply remove HEADER word from the command.
  3. My file uses ‘;’ instead of ‘,’. Also easy. Simply replace ',' with ';' behind DELIMITER in the command.

🔗 COPY-ing with Elixir

As you saw above, you can execute the COPY-command directly in a psql session in your database. However, let’s see how we can build an Elixir wrapper around this command instead. That way, you can easily reuse your already configured Ecto.Repo connection of your Phoenix application.

Let’s create a file in app/lib/mix/tasks/import_data.ex and fill it with a Mix.Task like this:

defmodule Mix.Tasks.ImportData do
  use Mix.Task

  def run(path) do
    Mix.Task.run("app.start", [])

    opts = App.Repo.config()
    {:ok, pid} = Postgrex.start_link(opts)

    Postgrex.transaction(
      pid, 
      fn conn ->
        stream =
          Postgrex.stream(
            conn,
            "COPY locations(name,latitude,longitude) FROM STDIN CSV HEADER DELIMITER ','",
            []
          )

        Enum.into(File.stream!(path, [:trim_bom]), stream)
      end,
      timeout: :infinity
    )
  end
end

The Mix.Task receives a path to the CSV file and uses your application’s Postgrex-connection to execute the COPY-command. It creates a new connection to Postgres using Postgrex.start_link/1 and streams the CSV file through the connection using File.stream!/2. We set the timeout to :infinity, because otherwise the import transaction might time out if it takes longer than 15 seconds, which is the default timeout. When we create the File.stream!/2 we use the :trim_bom option. This option instructs the File-reader to remove any byte order marks. These magic numbers give meta-data about your encoding and stream readiness which we don’t want to import into our database. That’s why we remove them while streaming the data.

Now, we can almost import the data. As a last step, make sure that your Ecto.Repo connection is properly configured in your application. This can look like the following:

# config/dev.exs

config :app, App.Repo,
  username: "postgres",
  password: "postgres",
  hostname: "localhost",
  database: "locations_dev",
  pool_size: 10

Or, if you want to connect using a URL instead:

# config/runtime.exs

  config :app, App.Repo,
    url: System.get_env("DATABASE_URL"),
    pool_size: 10

Once you configured your Ecto.Repo connection, let’s try to import the data with:

mix import_data ./relative-or-absolute-path-to-your-file/data.csv

And that’s it! Now, you can import large CSVs directly into your database using a Mix.Task in just a matter of seconds! Amazing! :)

Liked this post?

Get notified about new posts