Migrating Databases with Ecto 25 Nov 2022

On a new project I'm working to migrate data between Azure SQL (aka MSSQL) and PostgreSQL, and to migrate a business's software from Visual Basic and C# to Elixir. In the first few weeks of development I've discovered some features of Ecto that I was previously unaware of. Combining these features will allow us to ship a resilient, well-tested application while constraining the possibility of accidentally altering data in production.

defmodule Test.Integration.CrossRepoTransactions do
  use Test.DataCase, async: true

  test "sharing database connections" do
    {:ok, alice} =
        Test.Fixtures.azure_person(:alice)
        |> Azure.Person.changeset()
        |> Azure.WriteableRepo.insert!()

    assert {:ok, _} = Azure.ReadOnlyRepo.get(Azure.Person, alice.id)
  end
end

Caveats

I in no way wish to imply that Visual Basic or C# codebases cannot be resilient or well tested, nor that PostgreSQL and Elixir are inherently better than Azure SQL and the dotnet ecosystem. The existing codebases for this company, however, do not have tests, and the team looking to extend the product functionality do not have the technical expertise to make them comfortable developing (or operating) the existing stack.

In this case, with this team, a re-write and a data migration made the most sense.

𐡷
Ivan Bandura @ Unsplash : https://unsplash.com/photos/bk3d-L7iVVE

Ecto and Azure SQL

Upon being first introduced the project, I did a quick search of hex.pm to find an Ecto adapter for mssql, finding several libraries that had not been updated in several years, with download counts implying minimal usage. Uh oh, I thought. This is going to be difficult.

Then my coworker mentioned that an adapter shipped with Ecto itself. Lo and behold, the ecto_sql package includes Ecto.Adapters.Tds, which can be used to connect to MSSQL Server or Azure SQL. This requires the addition of another library, tds.

defmodule MyApp.MixProject do
  use Mix.Project

  # ...

  defp deps do
    [
      # ... other dependencies``
      {:ecto, "~> 3.9"},
      {:ecto_sql, "~> 3.9"},
      {:tds, "~> 2.3"}
    ]
  end
end

Leaving our primary PostgreSQL repo alone, we'll add a second repo at lib/azure/repo.ex:

defmodule Azure.Repo do
  @moduledoc "An `Ecto.Repo` connecting to Azure SQL"
  use Ecto.Repo,
    adapter: Ecto.Adapters.Tds,
    otp_app: :my_app
end

This will require that we add some configuration to a few files. Note that we get the password from the environment; this will compile the password into the application, but since it's only in development and test I don't worry too much about it. For production, we do not set the password in config/prod.exs, but instead pull in a database URL in config/runtime.exs, to ensure that no secrets are compiled into our production application.

# config/config.exs
config :my_app, ecto_repos: [Core.Repo, Azure.Repo]
config :my_app, Azure.Repo, priv: "priv/azure_repo"

# config/dev.exs, config/test.exs
config :my_app, Azure.Repo,
  database: "my_app_#{config_env()}",
  hostname: "localhost",
  password: System.get_env("MSSQL_SA_PASSWORD"),
  pool_size: 10,
  port: 1433,
  show_sensitive_data_on_connection_error: true,
  stacktrace: true,
  username: "sa"

The :ecto_repos configuration determines how mix tasks will behave. mix ecto.gen.migration will create migrations in two directories, priv/repo/migrations and priv/azure_repo/migrations. mix ecto.migrate and mix ecto.rollback will run against both directories unless you specify -r Core.Repo. Depending on how you run production migrations, you might want to think about which repos are configured. We do not use (or even have) mix in production environments, as we ship Elixir releases; we hard-code specific repos in our release module.

We also have to start our new repo in lib/core/application.ex:

defmodule Core.Application do
  @moduledoc false
  use Application

  @impl true
  def start(_type, _args) do
    children =
      [
        Web.Telemetry,
        Core.Repo,
        Azure.Repo,
        {Phoenix.PubSub, name: Core.PubSub},
        {Finch, name: Core.Finch},
        Web.Endpoint
      ]

    opts = [strategy: :one_for_one, name: Core.Supervisor]
    Supervisor.start_link(children, opts)
  end

  # ...
end

But what about running the database itself during our development workflow? Happily, Microsoft provides a docker image that runs on both amd64 and aarch64, allowing us to run it on our M1 Macs as well as in GitHub Actions. The README for the docker image suggests that it includes a command line tool sqlcmd, but it only compiles on amd64, and thus is not available in the image version that we use on our Macs.

docker run --rm \
  --cap-add SYS_PTRACE \
  -e 'ACCEPT_EULA=1' \
  -e "MSSQL_SA_PASSWORD=${MSSQL_SA_PASSWORD}" \
  -p 1433:1433 \
  -v $PWD/priv/mssql/data:/var/opt/mssql/data \
  -v $PWD/priv/mssql/log:/var/opt/mssql/log \
  --name azuresqledge \
  --detach \
  "mcr.microsoft.com/azure-sql-edge"
name: Test & Audit
on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main
env:
  ELIXIR_VERSION: "1.14.2"
  OTP_VERSION: "25.1.2"
  MSSQL_SA_PASSWORD: "Super-secure-password"
jobs:
  test:
    name: Test
    needs: build_test
    runs-on: ubuntu-20.04
    env:
      MIX_ENV: test
    services:
      postgres:
        image: postgres:15.1-alpine
        env:
          POSTGRES_PASSWORD: postgres
          POSTGRES_USER: postgres
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432
      mssql:
        image: mcr.microsoft.com/azure-sql-edge
        env:
          ACCEPT_EULA: 1
          MSSQL_SA_PASSWORD: ${{ env.MSSQL_SA_PASSWORD }}
        options: >-
          --cap-add SYS_PTRACE
        ports:
          - 1433:1433
    steps:
      - uses: actions/[email protected]
      - name: Set up Elixir
        uses: erlef/[email protected]
        with:
          elixir-version: ${{ env.ELIXIR_VERSION }}
          otp-version: ${{ env.OTP_VERSION }}
          version-type: "strict"
      - name: Cache deps
        uses: actions/[email protected]
        with:
          path: deps
          key: ${{ runner.os }}-test-deps-v1-${{ hashFiles('**/mix.lock', '.tool-versions') }}
      - name: Cache _build
        uses: actions/[email protected]
        with:
          path: _build
          key: ${{ runner.os }}-test-build-v1-${{ hashFiles('**/mix.lock', '.tool-versions') }}
      - name: Run tests
        run: mix test --color --include external:true
        working-directory: .

Outside of docker, we can install a new-ish (at the time of writing) Golang port of sqlcmd. It behaves slightly differently from the original version of the tool, but allows us to specify the password with $SQLCMDPASSWORD.

brew install sqlcmd
SQLCMDPASSWORD="${MSSQL_SA_PASSWORD}" sqlcmd -U sa

1> SELECT 1;
1> GO

-----------
          1

(1 row affected)

Because we only start this container for this project and we use the same environment variable everywhere we connect to the database, we commit the password into .envrc, which is loaded by the direnv tool that we use in development.

𐡷

Read-only Repo

<aside>

I prefer iterative development plans. Where possible, I especially like development plans where deadlines can be fuzzed and put into the hands of the key project stakeholders. Why say that we will go live on a specific date, when experience tells us that unforseen complications will invariably arise as a part of development? Data will turn out to be different than expected. External integrations will be more complicated and error prone than described in documentation (if documentation exists at all).

Instead of go-live deadlines, I prefer to commit to making things available to stakeholders as soon as possible. Starting a new project? Set up a staging and production deployment on day one. Can't set it up on day one? Try to deploy on day two. There's nothing in the application to see? Who cares. Deploy it. Set up continual deployment to the staging environment from CI, even if a manual step is used to deploy production.

Allow the stakeholders direct access to the new app or the new features, possibly through feature flags that can be enabled only for them. In the case of a new application, this can as simple as deploying it on a subdomain that is not linked anywhere, with authentication only available to specific users. This might make authentication flows higher priority to implement earlier.

When migrating to a new tech stack, this can allow you to run both applications... the people doing the work can use both workflows, and tell you when the new application is ready to replace the old one. Until then, they can use the old application for their primary work.

</aside>

Our new application will start by taking over only the very end of the business's workflows, consuming data from Azure SQL but leaving alone the processes that update data. After the key stakeholders decide that the new application is complete enough to use instead of their old software, we can decommission one of their old applications, and begin working up the chain of business processes.

This means that, at least in the near term, we would prefer that our connection to Azure SQL in production be read-only. In development and tests, we'll need to write data into our local database, but if we ever accidentally ship code into production that alters data in the old database, I would like an exception to be raised.

Ecto gives this to us with the read_only flag when defining the repo.

defmodule Azure.Repo do
  @moduledoc "A read-only `Ecto.Repo` connecting to Azure SQL"
  use Ecto.Repo,
    adapter: Ecto.Adapters.Tds,
    otp_app: :my_app,
    read_only: true
end

By default when a module uses Ecto.Repo, it defines not only read callbacks such as all, get, and one, but write callbacks such as delete, insert, and update. When specifying that a repo is read_only: true, only the callbacks that do not alter data are added. One cannot accidentally alter data via changesets, because the functions to do so are not present.

iex(1)> Azure.Person.changeset(%{}) |> Azure.Repo.insert()
** (UndefinedFunctionError) function Azure.Repo.insert/1 is undefined or private
    (my_app 0.1.0) Azure.Repo.insert(#Ecto.Changeset<action: nil, changes: %{}, errors: [], data: #Azure.Person<>, valid?: true>)
    iex:1: (file)

If someone really wanted to cause a problem, they could do so with Azure.Repo.query/1, so we take the additional precaution of only providing read-only credentials to the staging and production environments.

These days, I might ship something using query to quickly experiment with data where I can't figure out how to shape the query using Ecto.Query macros, but I prefer to eventually refactor towards using the macros, and I do not alter data using raw queries.

But because migrations are inserted into the schema_migrations table using the repo's insert function, we can no longer run migrations for our azure repo.

# config/config.exs
config :my_app, ecto_repos: [Core.Repo]
𐡷

Seeds and Development

Now that our application code is only able to read from the database, we run into a new problem: how can I write seed data into the database for development?

Here is where I really appreciate that Ecto allows you to define multiple repo modules.

defmodule Azure.WriteRepo do
  @moduledoc "A read-write `Ecto.Repo` connecting to Azure SQL. Only started in dev/test."
  use Ecto.Repo,
    adapter: Ecto.Adapters.Tds,
    otp_app: :my_app
end

We can bring back the ability to run migrations, optionally specifying the directory into which our migrations will reside. As we develop more parts of our data migration code, we'll dump the schemas of more and more tables into migrations, to ensure that our development and test databases run against a matching database structure.

# config/config.exs
config :my_app, ecto_repos: [Core.Repo, Azure.WriteRepo]
config :my_app, Azure.WriteRepo, priv: "priv/azure_repo"

# config/dev.exs, config/test.exs
for repo <- [Azure.Repo, Azure.WriteRepo] do
  config :my_app, repo,
    database: "my_app_#{config_env()}",
    hostname: "localhost",
    password: System.get_env("MSSQL_SA_PASSWORD"),
    pool_size: 10,
    port: 1433,
    show_sensitive_data_on_connection_error: true,
    stacktrace: true,
    username: "sa"
end

We can ensure that only the read-only repo is available in production by conditionally starting it in our application.

defmodule Core.Application do
  @moduledoc false
  use Application

  @impl true
  def start(_type, _args) do
    children =
      [
        Web.Telemetry,
        ecto_repos(),
        {Phoenix.PubSub, name: Core.PubSub},
        {Finch, name: Core.Finch},
        Web.Endpoint
      ]
      |> List.flatten()

    opts = [strategy: :one_for_one, name: Core.Supervisor]
    Supervisor.start_link(children, opts)
  end

  defp ecto_repos, do: Application.fetch_env!(:my_app, :started_ecto_repos)

  # ...
end

# config/dev.exs, config/test.exs
config :my_app, started_ecto_repos: [Core.Repo, Azure.Repo, Azure.WriteRepo]

# config/prod.exs
config :my_app, started_ecto_repos: [Core.Repo, Azure.Repo]

We still have the read-only database credentials as a final protection, but if someone were to ship code that tries to use the read/write repo in production, Ecto will raise with a clear error signaling that the repo has not been started.

𐡷

Async Tests and the SQL Sandbox

Now that we have two separate repos, one for read-only production access, and one for seeding data in development and tests, we have a new problem: how do we write async tests?

Ecto provides us with the Ecto.Adapters.SQL.Sandbox, which wraps each ExUnit test in a database transaction, providing data isolation for each test. At the beginning of each test, a transaction is opened. When queries are made from any process that has been allowed to access a specific connection checkout, the sandbox ensures that the process receives the same connection, and thus is in the same transaction.

# test/support/data_case.ex

def setup_sandbox(tags) do
  pid = Ecto.Adapters.SQL.Sandbox.start_owner!(Core.Repo, shared: not tags[:async])

  on_exit(fn ->
    Ecto.Adapters.SQL.Sandbox.stop_owner(pid)
  end)
end

Now that we have two repos, each with its own connection pool, and the production code will only use the read-only repo, we have a new problem. If we write into the test database via the writeable repo, with transaction isolation, the read-only repo will not be able to see any of the records.

Let's take a look at the internals of how Ecto.Adapters.SQL.Sandbox checks out a database connection.

def checkout(repo, opts \\ []) when is_atom(repo) or is_pid(repo) do
  %{pid: pool, opts: pool_opts} = lookup_meta!(repo)

  pool_opts =
    if Keyword.get(opts, :sandbox, true) do
      [
        post_checkout: &post_checkout(&1, &2, opts),
        pre_checkin: &pre_checkin(&1, &2, &3, opts)
      ] ++ pool_opts
    else
      pool_opts
    end

  pool_opts_overrides = Keyword.take(opts, [:ownership_timeout, :isolation_level])
  pool_opts = Keyword.merge(pool_opts, pool_opts_overrides)

  case DBConnection.Ownership.ownership_checkout(pool, pool_opts) do
    :ok ->
      if isolation = opts[:isolation] do
        set_transaction_isolation_level(repo, isolation)
      end

      :ok

    other ->
      other
  end
end

The post_checkout and pre_checkin callbacks call through to the connection modules, calling through to the database to open or cancel a transaction. In terms of PostgreSQL, this calls through to Postgrex to issue a BEGIN statement and a ROLLBACK statement. For Azure SQL, this uses Tds to TM_BEGIN_XACT and TM_ROLLBACK_XACT. Nice to know, but not helpful for our purposes.

DBConnection.Ownership.ownership_checkout/2 is interesting, and provides some interesting possibilities when one follows the code trail deeper into the library. This is where I went next, but gave up after about a half hour of fiddling with different settings.

At this point, in the interest of shipping something, I switched some tests with async: false, with some manual resetting of tables.

# test/support/data_case.ex

def setup_sandbox(tags) do
  pid = Ecto.Adapters.SQL.Sandbox.start_owner!(Core.Repo, shared: not tags[:async])

  on_exit(fn ->
    Ecto.Adapters.SQL.Sandbox.stop_owner(pid)
  end)

  if tags[:azure_repo] do
    ro_pid = Ecto.Adapters.SQL.Sandbox.start_owner!(Azure.Repo, shared: true)
    rw_pid = Ecto.Adapters.SQL.Sandbox.start_owner!(Azure.WriteRepo, shared: true)

    on_exit(fn ->
      ~w[my azure tables]
      |> Enum.each(fn table_name -> Azure.WriteRepo.query!("TRUNCATE #{table_name}") end)

      Ecto.Adapters.SQL.Sandbox.stop_owner(ro_pid)
      Ecto.Adapters.SQL.Sandbox.stop_owner(rw_pid)
    end)
  end
end

Not the happiest of outcomes, but acceptable in the short term to ship some code.

Enter Dynamic Repos

Given another look at the sandbox, I noticed that the pool itself is found via a private function lookup_meta!/1. What is that?

defp lookup_meta!(repo) do
  %{opts: opts} =
    meta =
    repo
    |> find_repo()
    |> Ecto.Adapter.lookup_meta()

  if opts[:pool] != DBConnection.Ownership do
    raise """
    cannot invoke sandbox operation with pool #{inspect(opts[:pool])}.
    To use the SQL Sandbox, configure your repository pool as:
        pool: #{inspect(__MODULE__)}
    """
  end

  meta
end

defp find_repo(repo) when is_atom(repo), do: repo.get_dynamic_repo()
defp find_repo(repo), do: repo

Finding the repo uses get_dynamic_repo/0, the documentation of which reads:

@callback get_dynamic_repo() :: atom() | pid()

Returns the atom name or pid of the current repository.

See put_dynamic_repo/1 for more information.

The documentation for put_dynamic_repo/1 includes the following:

Sometimes you may want a single Ecto repository to talk to many different database instances.

The included examples imply that this can be used for multi-tenant applications, where each tenant resides in a different database. I can see how this could also be used to shard data across multiple databases, or across N PostgreSQL schemas sharded across M database servers.

But if it can be uses to share a single repo module across multiple connection pools, could it be used to share a single connection pool across multiple ecto repos? If so, and if the SQL sandbox uses only the pool and the current pid in order to determine how to reuse connections in tests, then maybe we can use both repositories across a single connection.

# test/support/data_case.ex

def setup_sandbox(tags) do
  Azure.WriteRepo.put_dynamic_repo(Azure.Repo)
  pg_pid = Ecto.Adapters.SQL.Sandbox.start_owner!(Core.Repo, shared: not tags[:async])
  ro_pid = Ecto.Adapters.SQL.Sandbox.start_owner!(Azure.Repo, shared: not tags[:async])
  rw_pid = Ecto.Adapters.SQL.Sandbox.start_owner!(Azure.WriteRepo, shared: not tags[:async])

  on_exit(fn ->
    Ecto.Adapters.SQL.Sandbox.stop_owner(pg_pid)
    Ecto.Adapters.SQL.Sandbox.stop_owner(ro_pid)
    Ecto.Adapters.SQL.Sandbox.stop_owner(rw_pid)
  end)
end
defmodule Test.Integration.CrossRepoTransactions do
  use Test.DataCase, async: true

  test "sharing database connections" do
    {:ok, alice} =
        Test.Fixtures.azure_person(:alice)
        |> Azure.Person.changeset()
        |> Azure.WriteableRepo.insert!()

    assert {:ok, _} = Azure.Repo.get(Azure.Person, alice.id)
  end
end

Voilà! The tests pass.

Deadlocked tests

Note that in the documentation for the Ecto.Adapters.SQL.Sandbox database support, a disclaimer is made that, between MySQL and PostgreSQL, only the latter supports concurrent tests. The MySQL transaction implementation is more prone to deadlocks.

So far, we've found the similar case with Azure SQL, at least in our tests. While the above code allowed us to make most of our tests async, we had to keep some of our tests async: true to avoid deadlocks. In the future, hopefully we can discover the reason for the deadlocks and make all of our tests async. Even if we had to run all tests interacting with Azure SQL synchronously, however, this test setup is simpler and easier to understand.

𐡷

Postscript

In real code we don't name things Azure, but rather name entities based on their function or purpose. Azure.Repo is used in these examples to ensure that the source of the data is clear.

𐡷