Introduction
NumPy is a well-known library for Python, which provides an N-dimensional array and high-performance operations. Its tagline is "The fundamental package for scientific computing with Python".
Pydantic is a data validation library for Python. The basic usage pattern is to define a custom datatype by declaring a class with type-hinted attributes. Through the magic metaclass machinery of Pydantic, you will get robust and performant methods for converting to and from JSON or python dictionaries.
This article shows how to make these two essential libraries work together.
The Issue
When you declare a new class inheriting from pydantic.BaseModel, Pydantic inspects the attributes you declare, and their type hints to generate the procedures to load and dump instances of the class (called validation and serialization respectively).
If Pydantic does not know how to do this with one of the types you are using, this step will fail, and you will get an error at module load time. For example, Pydantic knows how to deal with a list of floats, but not how to deal with a NumPy Array, so this code will work:
from pydantic import BaseModel
class WorkingModel(BaseModel):
list_of_floats: list[float]
But this one will not:
import numpy as np
from pydantic import BaseModel
class BrokenModel(BaseModel):
numpy_array: np.ndarray
The error message gives you a good lead to address the issue:
PydanticSchemaGenerationError: Unable to generate pydantic-core schema for
<class 'numpy.ndarray'>. Set `arbitrary_types_allowed=True` in the model_config
to ignore this error or implement `__get_pydantic_core_schema__` on your
type to fully support it.
Other Tools
Before getting into the code I should point out a few libraries that allow Pydantic to interoperate with NumPy.
These are certainly much more complete solutions than the code I am presenting below, but the code may still be useful if you do not want to bring in another dependency or want a fairly simple example of making a custom type work with Pydantic.
The code
The following code allows you to use NumPy Arrays in your Pydantic models. It is limited to 1D arrays of floats.
- When validating JSON, a list of floats (numbers) can get converted to a NumPy Array
- In Python, it accepts lists of numbers, but also NumPy arrays, though it will raise an error if it encounters arrays with more than 1 dimension.
- The data will be converted to a list of numbers at serialization time
- In the JSON Schema, they are marked as lists of numbers/floats
- If an attribute is annotated as
Numpy1Df64, in the validated object, the member will literally be a NumPy array, not a wrapped or subclassed version!
from __future__ import annotations
from collections.abc import Iterable
from typing import Annotated, Any, TypeVar
import numpy as np
from pydantic import GetCoreSchemaHandler, GetJsonSchemaHandler
from pydantic.json_schema import JsonSchemaValue
from pydantic_core import CoreSchema, core_schema
T = TypeVar("T")
class _Numpy1Df64:
@classmethod
def __get_pydantic_core_schema__(
cls, source_type: Any, handler: GetCoreSchemaHandler
) -> CoreSchema:
def reject_non_1d_array(value: T) -> T:
if isinstance(value, np.ndarray) and value.ndim != 1:
msg = f"Array dimension must be 1 (got {value.ndim})"
raise ValueError(msg)
return value
def from_ndarray(value: np.ndarray) -> np.ndarray:
return value.astype(np.float64)
from_ndarray_schema = core_schema.chain_schema(
[
core_schema.is_instance_schema(np.ndarray),
core_schema.no_info_plain_validator_function(from_ndarray),
]
)
def from_float_list(value: Iterable[Any]) -> np.ndarray:
return np.array(value, dtype=np.float64)
from_float_list_schema = core_schema.chain_schema(
[
core_schema.list_schema(core_schema.float_schema()),
core_schema.no_info_plain_validator_function(from_float_list),
]
)
return core_schema.json_or_python_schema(
json_schema=from_float_list_schema,
python_schema=core_schema.chain_schema(
[
core_schema.no_info_plain_validator_function(reject_non_1d_array),
core_schema.union_schema(
[from_ndarray_schema, from_float_list_schema],
mode="left_to_right",
),
]
),
serialization=core_schema.plain_serializer_function_ser_schema(list),
)
@classmethod
def __get_pydantic_json_schema__(
cls, _core_schema: CoreSchema, handler: GetJsonSchemaHandler
) -> JsonSchemaValue:
return handler(core_schema.list_schema(core_schema.float_schema()))
Numpy1Df64 = Annotated[np.ndarray[tuple[int], np.dtype[np.float64]], _Numpy1Df64]
The Annotation
Let's start at the bottom.
Numpy1Df64 = Annotated[np.ndarray[tuple[int], np.dtype[np.float64]], _Numpy1Df64]
In Pydantic, we use Annotated to add information about a datatype. In this context, the identity of the datatype is np.ndarray[tuple[int], np.dtype[np.float64]].
- The type is
np.ndarraywhich is a NumPy Array - The first parameter in the square brackets is the shape parameter. Here,
tuple[int]means the shape is a single integer: i.e. our array is 1D. Currently, the shape parameter is not particularly useful, as other NumPy functions tend to remove this information. This issue tracks shape support. - The second parameter is
np.dtype[np.float64], which means the datatype of the elements is a 64-bit float, sometimes called a double.
The second parameter to Annotated is the class that Pydantic uses to parse any attribute annotated with the Numpy1Df64 type.
The Core Schema
In our _Numpy1Df64 we have 2 methods which tell Pydantic what to do. The most important one is the __get_pydantic_core_schema__ function, which tells Pydantic how to validate (parse) and serialise (dump) the type. The other is __get_pydantic_json_schema__, which tells Pydantic how to generate entries in a JSON schema file. The official guide can be found in the Pydantic docs.
Let's start with the more difficult one and what it returns
return core_schema.json_or_python_schema(
json_schema=from_float_list_schema,
python_schema=core_schema.chain_schema(
[
core_schema.no_info_plain_validator_function(reject_non_1d_array),
core_schema.union_schema(
[from_ndarray_schema, from_float_list_schema],
mode="left_to_right",
),
]
),
serialization=core_schema.plain_serializer_function_ser_schema(list),
)
The json_schema argument tells what validator to use on incoming JSON data. With JSON, we only need to consider lists of numbers so we just use the handler defined above for those.
This schema from_float_list_schema has two stages which are run sequentially via chain_schema.
from_float_list_schema = core_schema.chain_schema(
[
core_schema.list_schema(core_schema.float_schema()),
core_schema.no_info_plain_validator_function(from_float_list),
]
)
The first is just a standard schema for reading lists of floats and the second uses the custom function from_float_list to convert a list of floats to a NumPy array. This shows how we can leverage the readymade validation components from Pydantic, combined with custom validation code.
With python it is a bit more complicated, because we also want to be able to accept objects that are already NumPy arrays, but only if they are 1D.
If we want to accept multiple input types, the strategy is to use a union_schema, which goes through the list of schemas you specify and uses the first result that occurs without an error. We start with a schema that only works on NumPy arrays, followed by the list-of-float schema mentioned before.
core_schema.union_schema(
[from_ndarray_schema, from_float_list_schema],
mode="left_to_right",
),
The from_ndarray_schema is also pretty straightforward:
def from_ndarray(value: np.ndarray) -> np.ndarray:
return value.astype(np.float64)
from_ndarray_schema = core_schema.chain_schema(
[
core_schema.is_instance_schema(np.ndarray),
core_schema.no_info_plain_validator_function(from_ndarray),
]
)
We first use the built-in is_instance_schema to check if the object is a NumPy Array, then we run the custom from_ndarray function, which just ensures the element datatype is a 64-bit float.
The final piece of the puzzle is the reject_non_1d_array function, which will raise an error if it encounters a non-1D NumPy array. This schema is put as the first element of the chain used to validate python values.
def reject_non_1d_array(value: T) -> T:
if isinstance(value, np.ndarray) and value.ndim != 1:
msg = f"Array dimension must be 1 (got {value.ndim})"
raise ValueError(msg)
return value
You may notice there is a repetition here, because we check for NumPy arrays in the reject_non_1d_array step, but also as part of the from_ndarray_schema check. So why don't we check the dimension in from_ndarray_schema?
The reason is that from_ndarray_schema is part of a union schema. Meaning, if we raise an error attempting to parse a NumPy array, Pydantic will move on to trying from_float_list_schema. It turns out that from_float_list_schema will accept non-1D arrays, but just flattens them in the core_schema.list_schema(core_schema.float_schema()) step. This means we need to reject non-1D arrays first before allowing further validation.
Phew! The last argument constructing our schema object is serialization. We can simply use the list constructor to convert our NumPy arrays to a list of floats, which Pydantic knows how to serialize.
serialization=core_schema.plain_serializer_function_ser_schema(list),
The JSON Schema
The other method is __get_pydantic_json_schema__.
@classmethod
def __get_pydantic_json_schema__(cls, _core_schema: CoreSchema, handler: GetJsonSchemaHandler) -> JsonSchemaValue:
return handler(core_schema.list_schema(core_schema.float_schema()))
Remember that in JSON world, our NumPy array is a list of floats, so that is what should be put in any entries of a JSON Schema file.
Alternatives
The purpose of this code was to be able to include 1D arrays of 64-bit floats in Pydantic model and be able to take advantage of Pydantic's powerful validation and serialisation capabilities.
class MyModel(BaseModel):
array: Numpy1Df64
The disadvantage of this method is the need to define the Numpy1Df64 class with its own validation code.
Let's discuss some other alternatives to achieve something similar.
Store as a List
The first way is to declare your array as a plain list in the model. The data will be stored as a list, but if you need a NumPy array, you will have to convert it.
class MyModel(BaseModel):
array: list[float]
def f(data: MyModel):
array_np = np.array(data.array)
...
NumPy Array Property
Another idea is to store the array as a list, but include a property to access the data as an array.
class MyModel(BaseModel):
array: list[float]
@property
def array_np(self)-> np.ndarray[tuple[int], np.dtype[np.float64]]:
return np.array(self.array)
data: MyModel
# NumPy array
data.array_np
This is a bit more convenient to users of the class, because they don't have to remember to convert the array. By using a property, you also keep the NumPy array in sync with the original list. On the other hand, you need to regenerate the array every time, which could have drawbacks in terms of performance.
Field Validators
Another way is to use a Field Validator. The strategy is to add both array and array_np members and use a field validator to fill in the value of array_np.
class MyModel(BaseModel):
model_config = ConfigDict(arbitrary_types_allowed=True)
array: list[float]
array_np: SkipJsonSchema[np.ndarray[tuple[int], np.dtype[np.float64]]] = Field(
default_factory=lambda: np.zeros((0,), dtype=np.float64), validate_default=True, exclude=True
)
@field_validator("array_np", mode="after")
@classmethod
def _fill_array(
cls, value: np.ndarray[tuple[int], np.dtype[np.float64]], info: ValidationInfo
) -> np.ndarray[tuple[int], np.dtype[np.float64]]:
return np.array(info.data["array"])
- We need to add
arbitrary_types_allowedto the model config, so that the NumPy array is allowed. - Include a default factory for
array_npso that we don't need to specify it when constructing the model. - Use
excludeto preventarray_npfrom being serialised. - Use
validate_defaultto ensure that the field validator runs when we don't givearray_npin the input data. - The type of
array_npneeds to be surrounded bySkipJsonSchemaso that pydantic doesn't try and fail to include it in the JSON schema. - Use a field validator which takes the value of
arrayand converts it to a NumPy array. The value ofarrayis only available because it is declared beforearray_npin the model.
This method is OK, but it relies on a lot of distracting extra configuration. It also stores the data twice, whereas the solution in this article only stores it once. Overall, like the other alternatives, it lacks the expressiveness of simply declaring the attribute to be a NumPy array using Numpy1Df64.
Conclusion
In this post, I gave an example of how to write some custom validation code to enable the excellent Pydantic data validation library to interact with arrays from the equally excellent NumPy library.