Data Analysis via Interpolation¶
Overview, Objectives, and Key Terms¶
In this lecture, we continue the theme of data analysis initiated in Lecture 24, focusing now on the process of interpolation. Whereas the fitted models of Lecture 24 allow one to produce simple models from noisy (or otherwise complicated) data, interpolation allows one to estimate new values from known discrete points that are assumed to be noise free.
Objectives¶
By the end of this lesson, you should be able to
- Apply linear interpolation to evaluate \(f(x)\) given discrete points \((x_i, f(x_i))\) for \(i = 0, 1 \ldots\).
- Apply linear interpolation to evaluate \(f(x, y)\) given discrete points \((x_i, y_j, f(x_i, y_j))\) for \(\, i = 0, 1, \ldots\) and \(j = 0, 1,\ldots\).
- Apply tools from NumPy and SciPy to perform high-order (e.g., cubine-spline) interpolation
Key Terms¶
- linear interpolation
np.interp
scipy.interpolate
interp1d
interp2d
Simple Linear Interpolation¶
A common task in many thermal engineering courses is to use tabulated physical properties. Often, the value needed is not in the table, and linear interpolation is required to estimate the missing value.
Let the quantity of interest be \(y(x)\), for which values \(y_0, y_1, \ldots\) are provided at \(x_0, x_1, \ldots\). Suppose the value of \(y\) at \(x = \tilde{x}\) is desired. The first step is to determine which two values \(x_i\) and \(x_{i+1}\) contain the point \(\tilde{x}\). This step is accomplished using the linear or binary search algorithms developed in Lecture 14.
Exercise: Apply linear search to find \(i\) such that \(\tilde{x} \in [x_i, x_{i+1}]\) for \(\tilde{x} = 3.5\) and \(x = [0, 1, 2, 3, 4, 5]\).
Exercise: Repeat the exercise using binary search.
Exercise: For the same \(\tilde{x}\) and points \(x\),
evaluate np.array(x) < x_tilde
and sum(np.array(x) < x_tilde)
.
What can you conclude?
Exercise: Use help
to learn about np.searchsorted
. Then, for
the same \(\tilde{x}\) and points \(x\), evaluate
np.searchsorted(x, x_tilde)
. What can you conclude?
Once \(i\) is determined, \((x_i, y(x_i))\) and \((x_{i+1}, y(x_{i+1}))\) are connected by a line of the form \(ax + b\), and \(y(\tilde{x})\) is evaluated from \(a\tilde{x} + b\). The trick is to determine \(a\) and \(b\), but that’s easy: we have two equations \(y_i = ax_i + b\) and \(y_{i+1} = a x_{i+1} + b\). Just solve for \(a\) and \(b\):
Exercise: Consider the following table of thermodynamic data for saturated steam (saved as thermo_data.txt):
T [K] P [MPa] h_l [kJ/kg] h_v [kJ/kg]
373.150 0.101 2675.572 419.099
389.817 0.179 2701.054 489.625
406.483 0.298 2724.633 560.634
423.150 0.476 2745.919 632.252
439.817 0.730 2764.528 704.626
456.483 1.082 2780.059 777.934
473.150 1.555 2792.062 852.393
489.817 2.175 2800.004 928.269
506.483 2.972 2803.247 1005.893
523.150 3.976 2801.012 1085.687
539.817 5.222 2792.314 1168.202
556.483 6.745 2775.820 1254.194
573.150 8.588 2749.574 1344.771
Evaluate \(h_l\) at 515 K using linear interpolation.
Solution
In [1]:
# first load the data
import numpy as np
T, P, h_l, h_v = np.loadtxt('thermo_data.txt', unpack=True, skiprows=1)
# find i such that T[i-1] <= 500 <= T[i]
T_tilde = 515
for i in range(len(T)):
if T[i] > T_tilde:
break
# compute slope and intercept
a = (h_l[i]-h_l[i-1])/(T[i]-T[i-1])
b = h_l[i-1] - a*T[i-1]
# compute approximation
h_l_tilde = a*T_tilde + b
h_l_tilde
Out[1]:
2802.1048931421374
Exercise: Adapt the solution above to produce a function that returns \(h_l\) at any temperature T. However, add assertions to ensure the user does not request \(h_l\) at temperatures outside the range of the data. In other words, limit the function to interpolation and exclude any extrapolation.
Exercise: Suppose \(\tilde{x}\) is beyond \(x_{n-1}\), the last of the \(n\) values of \(x\) in the tabulated data. How would one linearly extrapolate the data to obtain \(y(\tilde{x})\)?
Exercise: Execute help(np.interp)
. Use it to interpolate
\(x = [1, 2, 3]\) and \(y = [2, 8, 18]\) for
\(\tilde{x} = 1.5\) and \(\tilde{x} = 2.5\).
Interpolation In Two Dimensions¶
Some applications require interpolation of two-dimensional data evaluated at \((x_i, y_j)\) for \(i=0,1,\ldots\) and \(j=0,1,\ldots\). Linear interpolation of such data is based on the model \(f(x, y) \approx ax + by + c\) for \(x \in [x_i, x_{i+1}]\) and \(y \in [y_j, y_{j+1}]\).
The trick now is to find \(i\) and \(j\) for values \(\tilde{x}\) and \(\tilde{y}\) for which \(f(x, y)\) is desired. However, the same ideas applied for one dimension work for this two-dimensional problem. To do so requires an extra step:
- Again, first assume that \(i\) and \(j\) are found by an appropriate search.
- Then, find \(f(\tilde{x}, y_{j})\) and \(f(\tilde{x}, y_{j+1})\) via two separate interpolations.
- Finally, find \(f(\tilde{x}, \tilde{y})\) by linearly interpolating the two values \(f(\tilde{x}, y_{j})\) and \(f(\tilde{x}, y_{j+1})\).
Exercise: Given \(f(1, 1) = 6\), \(f(1, 2) = 9\), \(f(2, 1) = 8\), and \(f(2, 2) = 11\), estimate \(f(4/3, 7/4)\) by linear interpolation.
Solution:
In [2]:
# define the given data
x = [1, 2]
y = [1, 2]
f = np.array([[6, 9],
[8, 11]])
xtilde = 4/3
ytilde = 7/4
# interpolate to get f(xtilde, 1)
a = (f[1, 0]-f[0, 0])/(x[1] - x[0])
b = f[0, 0] - a*x[0]
f_xt_1 = a*xtilde + b
# interpolate to get f(xtilde, 2)
a = (f[1, 1]-f[0, 1])/(x[1] - x[0])
b = f[0, 1] - a*x[0]
f_xt_2 = a*xtilde + b
# interpolate to get f(xtilde, ytilde)
a = (f_xt_2 - f_xt_1)/(y[1] - y[0])
b = f_xt_1 - a*y[0]
f_xt_yt = a*ytilde + b
f_xt_yt
Out[2]:
8.9166666666666661
Exercise: It turns out the previous exercise produces a perfect value for \(f(4/3, 7/4)\) when \(f(x, y) = 2x + 3y + 1\). Can you recover \(a=2\), \(b=3\), and \(c=1\) from your interpolation process?
Exercise: Suppose \(f(1, 1) = 7\), \(f(1, 2) = 11\), \(f(2, 1)=10\), and \(f(2, 2)=15\). Compute \(f(4/3, 7/4)\).
Exercise: Consider the following table of water densities (\(\rho\)) for different temperatures [K] and pressures [MPa] (from thermodata_2.txt):
T [K] P [MPa] rho [kg/m^3]
293.150 2.000 999.073
293.150 4.000 999.982
293.150 6.000 1000.888
293.150 8.000 1001.791
313.150 2.000 993.053
313.150 4.000 993.924
313.150 6.000 994.790
313.150 8.000 995.653
333.150 2.000 984.037
333.150 4.000 984.903
333.150 6.000 985.766
333.150 8.000 986.625
353.150 2.000 972.650
353.150 4.000 973.538
353.150 6.000 974.422
353.150 8.000 975.301
373.150 2.000 959.242
373.150 4.000 960.172
373.150 6.000 961.097
373.150 8.000 962.017
Use linear interpolation to estimate \(\rho(300, 3)\), \(\rho(330, 5)\), and \(\rho(360, 7)\).
Higher-Order Interpolation¶
Linear interpolation is conceptually straightforward, but it helps to
use tools like np.interp
(see the exercises above). However,
interpolation that is better than linear is more challenging to
implement, and we’ll turn to the scipy.interpolate
module and its
interp1d
and interp2d
utilities.
The syntax for interp1d
is quite simple:
f = interp1d(x, y, kind)
, where x
and y
are the points to
interpolate, and kind
is a str
argument equal to 'linear'
(default), 'quadratic'
, 'cubic'
, among other possibilities.
Here, linear
leads to linear interpolation equivalent to the method
introduced above. Quadratic and cubic interpolation involve defined
quadratic curves over sets of three points or cubic functions over sets
of four points. Of course, if the underlying data were actually
quadratic or cubic, these higher-order interpolation schemes should be
expected to do better. The resulting f
is a callable function that
accepts \(\tilde{x}\), the point at which the data is to be
estimated.
Example: Interpolate \(x = [1, 2, 3, 4, 5]\) and \(y = [1, 4, 9, 16, 25]\) for \(\tilde{x} = 4.5\)
Solution:
In [3]:
from scipy.interpolate import interp1d
f_lin = interp1d([1,2,3,4,5], [1,4,9,16,25], kind='linear')
f_quad = interp1d([1,2,3,4,5], [1,4,9,16,25], kind='quadratic')
f_lin(4.5), f_quad(4.5), 4.5**2
Out[3]:
(array(20.5), array(20.25), 20.25)
The syntax for interp2d
is equally simple:
f = interp2d(x, y, f, kind)
. Here, x
and y
are individual
arrays, while f
is a 2-D array representing \(f(x_i, y_j)\) for
\(i = 0, 1, \ldots\) and \(j = 0, 1, \ldots\). Any easy way to
evaluate f
is to use np.meshgrid
. Finally, kind
is a str
argument that can be 'linear'
and'cubic'
(among others, but
not 'quadratic'
).
Exercise: Given x = np.linspace(0, 10, 11)
,
y = np.linspace(0, 10, 11)
, and
z = lambda x, y: 1 + x + y + x**2 + y**2 + x**4*y**4
(a decidely
nonlinear function), evaluate f(4.5, 6.5)
using linea and cubic
interpolation.
Solution
In [4]:
x = np.linspace(0, 10, 11)
y = np.linspace(0, 10, 11)
xx, yy = np.meshgrid(x, y)
z = lambda x, y: 1 + x + y + x**2 + y**2 + x**4*y**4
from scipy.interpolate import interp2d
f_lin = interp2d(x, y, z(xx, yy), kind='linear')
f_cub = interp2d(x, y, z(xx, yy), kind='cubic')
f_lin(4.5, 6.5), f_cub(4.5, 6.5), z(4.5, 6.5)
Out[4]:
(array([ 814339.25]), array([ 731923.8169771]), 732061.69140625)
Further Reading¶
In this lecture, interpolation has been limited to regular grids of points along the \(x\) and \(y\) axes. When points are more than one dimension are not on such grids, different methods are needed. Barycentric interpolation is one option. In 2-D, points can be organized as triangles, and linear interpolation can be applied within a single triangle. In 3-D, triangles become pyramids, and in higher dimensions, one interpolates over general n-simplices.