What is lowess?
LOWESS stands for LOcally-Weighted Scatterplot Smoothing and is a non-parametric regression method, meaning no specifc function is specified, meaning the estimated graph does not follow a particular function. Lowess is quite powerfull to “get a feel” for data, without restricting yourself to any form. In plain terms, it is used to:
- fit a line to a scatter plot or time plot where noisy data values, sparse data points or weak interrelationships interfere with your ability to see a line of best fit.
- linear regression where least squares fitting doesn’t create a line of good fit or is too labor-intensive to use.
To illustrate the nice aspect of lowess graphs, I show a little example, with all code accessible via the link below. First, I simulate a pandas dataframe with two variables and 100 rows. I use a simple scatter plot, showing below.
As we can see, there is a non-linear relationship in the data, making it tricky to estimate the trend in the data. With linear data, we could simply plot a linear trend line, before moving on to e.g. estimate a model. As an example, I used seaborn.regplot to add a simple trend line, shown below:
As we can see, the linear trend line does not really capture the relationship of the data. It looks even worse, if we extend the X-axis:
Here, lowess comes into play, as it tries to fit a non-parametric line into the data. A detailed explanation about the mathematics behind lowess can be found here. Seaborn actually has a built in package to plot lowess lines, looking like this:
As we can see, the relationship of the data is shown pretty well! However, the estimation is build on the statsmodels package and gives you little freedom to change things and try around. Seaborne is quite fast in estimating the lowess line, which comes at a cost of accuracy. To illustrate this, I only used 10 observations and plotted the lowess line again. We can see that it is not really “smooth” anymore.
To overcome this inaccuracy, I adpated the following code, making it a bit simpler and adapting it for dataframes. It does not only allow to specify the sensitivity of the the individual estimations (via alpha), but also to determine the degree of the polynomial, which is pretty handy for more complex data.
Applying the graph to the underlying data at hand, we get a pretty smooth lowess line. For me an alpha of 0.7 and a polynomial degree of 2 leads to the best results. The result is pretty impressive, as you can see the trend of the data at 10 observations only (of course the underlying data do not include heavy outliers).
What is it all good for?
Lowess lines can help a lot when inspecting data. They are the natural extension of scatterplots and should help to get a feel for the relationship in the data. Especially, when data are messy or it is important to understand if trends are similar, such as in difference-in-difference studies, lowess can help!
This is my very first article on medium and I hope you like it. Please let me know if you have any improvement ideas. In case you want to find out more about data science, visualisation and python work, feel free to follow me.