Can Haskell's type system enforce correct ordering of data pipeline stages?

You can use a phantom type to store information about your dataset in its type, for example:

data Initial
data Cleaned
data Scaled

data Dataset a = Dataset { x :: Vector Double, y :: Vector Double, name :: String }

createDataset :: Vector Double -> Vector Double -> String -> Dataset Initial
createDataset x y name = Dataset x y name

removeOutliers :: Dataset Initial -> Dataset Cleaned
removeOutliers (Dataset x y n) =
    let (x', y') = clean x y
    in Dataset x' y' (n ++ "_clean")

With a few GHC extensions you can restrict the phantom type to a given state type and avoid declaring empty data types explicitly. For example:

{-# LANGUAGE DataKinds, KindSignatures #-}

data State = Initial | Cleaned | Scaled

data Dataset (a :: State) = Dataset { x :: Vector Double, y :: Vector Double, name :: String }

Yes, the modern Haskell type system can handle this. However, compared to usual, term-level programming, type-level programming in Haskell is still hard. The syntax and techniques are complicated, and documentation is somewhat lacking. It also tends to be the case that relatively small changes to requirements can lead to big changes in the implementation (i.e., adding a new "feature" to your implementation can cascade into a major reorganization of all the types), which can make it difficult to come up with a solution if you're still a little uncertain about what your requirements actually are.

@JonPurdy's comment and @AtnNn's answer give a couple of ideas of what's possible. Here's a solution that tries to address your specific requirements. However, it's likely to prove difficult to use (or at least difficult to adapt to your requirements) unless you're willing to sit down and teach yourself a fair bit of type-level programming.

Anyway, suppose you're interested in tagging a fixed data structure (i.e., always the same fields with the same types) with a type-level list of the processes that have been performed on it, with a means to check the process list against an ordered sublist of required processes.

We'll need some extensions:

{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE PolyKinds #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE UndecidableInstances #-}

The process tags themselves are defined as constructors in a sum type, with the DataKinds extension lifting the tags from the term level to the type level:

data Process = Cleaned | Transformed | Scaled | Inspected | Analyzed

The data structure is then tagged with a list of applied processes, its "pipeline":

data Dataset (pipeline :: [Process])
  = Dataset { x :: [Double]
            , y :: [Double]
            , name :: String }

NOTE: It will be most convenient for the pipeline to be in reverse order, with the most recently applied Process first.

To allow us to require that a pipeline has a particular ordered subsequence of processes, we need a type-level function (i.e., a type family) that checks for subsequences. Here's one version:

type family a || b where
  True  || b = True
  False || b = b

type family Subseq xs ys where
  Subseq '[]      ys  = True
  Subseq nonempty '[] = False
  Subseq (x:xs) (x:ys) = Subseq xs ys || Subseq (x:xs) ys
  Subseq xs     (y:ys) = Subseq xs ys

We can test this type-level function in GHCi:

λ> :kind! Subseq '[Inspected, Transformed] '[Analyzed, Inspected, Transformed, Cleaned]
Subseq '[Inspected, Transformed] '[Analyzed, Inspected, Transformed, Cleaned] :: Bool
= 'True
λ> :kind! Subseq '[Inspected, Transformed] '[Analyzed, Transformed, Cleaned]
Subseq '[Inspected, Transformed] '[Analyzed, Transformed, Cleaned] :: Bool
= 'False
λ> :kind! Subseq '[Inspected, Transformed] '[Transformed, Inspected]
Subseq '[Inspected, Transformed] '[Transformed, Inspected] :: Bool
= 'False

If you want to write a function that requires a dataset to have been transformed and then cleaned of outliers (in that order), possibly intermixed with other, unimportant steps with the function itself applying a scaling step, then the signature will look like this:

-- remember: pipeline type is in reverse order
foo1 :: (Subseq [Cleaned, Transformed] pipeline ~ True)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo1 = undefined

If you want to prevent double-scaling, you can introduce another type-level function:

type family Member x xs where
  Member x '[] = 'False
  Member x (x:xs) = 'True
  Member x (y:xs) = Member x xs

and add another constraint:

foo2 :: ( Subseq [Cleaned, Transformed] pipeline ~ True
        , Member Scaled pipeline ~ False)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo2 = undefined

Then:

> foo2 (Dataset [] [] "x" :: Dataset '[Transformed])
... Couldn't match type ‘'False’ with ‘'True’ ...
> foo2 (Dataset [] [] "x" :: Dataset '[Cleaned, Scaled, Transformed])
... Couldn't match type ‘'False’ with ‘'True’ ...
> foo2 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
-- typechecks okay
foo2 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
  :: Dataset '[ 'Scaled, 'Cleaned, 'Transformed]

You can make it all a little friendlier, both in terms of constraint syntax and error messages, with some additional type aliases and type families:

import Data.Kind
import GHC.TypeLits

type Require procs pipeline = Require1 (Subseq procs pipeline) procs pipeline
type family Require1 b procs pipeline :: Constraint where
  Require1 True procs pipeline = ()
  Require1 False procs pipeline
    = TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
                 Text " lacks required processing " :<>: ShowType procs)
type Forbid proc pipeline = Forbid1 (Member proc pipeline) proc pipeline
type family Forbid1 b proc pipeline :: Constraint where
  Forbid1 False proc pipeline = ()
  Forbid1 True proc pipeline
    = TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
                 Text " must not include " :<>: ShowType proc)

foo3 :: (Require [Cleaned, Transformed] pipeline, Forbid Scaled pipeline)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo3 = undefined

which gives:

> foo3 (Dataset [] [] "x" :: Dataset '[Transformed])
...The pipeline '[ 'Transformed] lacks required processing '[ 'Cleaned, 'Transformed]...
> foo3 (Dataset [] [] "x" :: Dataset '[Cleaned, Scaled, Transformed])
...The pipeline '[ 'Cleaned, 'Scaled, 'Transformed] must not include 'Scaled...
> foo3 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
-- typechecks okay
foo3 (Dataset [] [] "x" :: Dataset '[Cleaned, Transformed])
  :: Dataset '[ 'Scaled, 'Cleaned, 'Transformed]

A full code sample:

{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE PolyKinds #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE UndecidableInstances #-}

import Data.Kind
import GHC.TypeLits

data Process = Cleaned | Transformed | Scaled | Inspected | Analyzed

data Dataset (pipeline :: [Process])
  = Dataset { x :: [Double]
            , y :: [Double]
            , name :: String }

type family a || b where
  True  || b = True
  False || b = b

type family Subseq xs ys where
  Subseq '[]      ys  = True
  Subseq nonempty '[] = False
  Subseq (x:xs) (x:ys) = Subseq xs ys || Subseq (x:xs) ys
  Subseq xs     (y:ys) = Subseq xs ys

type family Member x xs where
  Member x '[] = False
  Member x (x:xs) = True
  Member x (y:xs) = Member x xs

type Require procs pipeline = Require1 (Subseq procs pipeline) procs pipeline
type family Require1 b procs pipeline :: Constraint where
  Require1 True procs pipeline = ()
  Require1 False procs pipeline
    = TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
                 Text " lacks required processing " :<>: ShowType procs)
type Forbid proc pipeline = Forbid1 (Member proc pipeline) proc pipeline
type family Forbid1 b proc pipeline :: Constraint where
  Forbid1 False proc pipeline = ()
  Forbid1 True proc pipeline
    = TypeError (Text "The pipeline " :<>: ShowType pipeline :<>:
                 Text " must not include " :<>: ShowType proc)


foo1 :: (Subseq [Cleaned, Transformed] pipeline ~ True)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo1 = undefined

foo2 :: ( Subseq [Cleaned, Transformed] pipeline ~ True
        , Member Scaled pipeline ~ False)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo2 = undefined

foo3 :: (Require [Cleaned, Transformed] pipeline, Forbid Scaled pipeline)
     => Dataset pipeline -> Dataset (Scaled : pipeline)
foo3 = undefined