Edx Course Review: Data Science for Aec
Written by: Yiyu Chen, Sustainability Engineer (LinkedIn)
As a building energy modeling professional, I find myself constantly needing to deal with an abundance of data: energy simulation results, metered energy data, and so on. This type of information can easily consume hundreds of gigabytes of hard drive space. I’m not here to complain about the size of the data. I’m trying to discover additional opportunities to make the best use of these resources. If you are an analytical person like me, here is an edX course for you!
The course is called, “Data Science for Construction, Architecture, and Engineering.” As self-explanatory as the name is, the five-week online course aims to bring the beauty of data science into the AEC industry. It is obvious: information is becoming more and more important in our industry. The emergence of Building Information Modeling has almost revolutionized the design process. Smart meters have been deployed more often in new construction buildings, energy modeling software is producing detailed results on a sub-hourly basis. Is there a better time to learn some data science skills and leverage them to these abundant resources?
The course is structured from basic to advanced so that everyone can follow along. Essentially, this is a programming class, from which you will be learning basic coding skills and applying them to real-world collections of building-specific data as the course progresses. Here is what each week has to offer, along with some of the use cases that excite me.
Week 1.
This week, you will be introduced to Python, one of the (if not “the”) most popular programming languages for data analysis. If you are already familiar with Python, this week is easy and will go fast for you. I had recently completed another online Intro to Python course. The instructor introduces Python fundamentals, such as syntax, variables, control statements, simple arithmetic, functions, and so on. It is crucial that you have a good grasp of Python fundamentals since the rest of the course is built on top of this programming language. This basic Python programming skill is useful for me as an energy modeler.
Use Case: Develop scripts in energy modeling software. The software we use offers a Python API in which the user can develop many scripts to expedite the workflow or do even more powerful things. We have been writing Python scripts to retrieve model data (e.g. climate and site information, room internal loads) and automatically generating workbooks in “xlsx” format that contain all the model information. With this script, we no longer need to “manually” put together all the tables we put in reports for our clients. It’s just one click away!
The Python basics are only briefly introduced in the first week. A more in-depth course is recommended to further strengthen your Python skill.
Week 2.
In the second week, a Python library called “pandas” is introduced. It is an open-source data analysis tool built on top of the Python programming language.
This week is where hands-on data analysis takes place. You will be learning pandas fundamentals, such as how to read a csv file from a directory, how to create a data frame, read a specific column or row, how to combine data frames together, and so on. Since this week is focused on dealing with design phase energy simulation data, all the exercises are performed on real energy modeling results. You are asked to compare simulation data between different energy efficiency measures, describe differences, and draw conclusions.
Use Case: Analyzing energy simulation data. You might be thinking: Oh, I can do all that in Excel. Admittedly, that is how we typically analyze the simulation results. The beauty of using pandas is that it is much more efficient in parsing larger data sets. I remember how inconvenient it is to look at Title 24 (California energy compliance) hourly simulation results stored in a csv file. Pandas reads csv files. Wouldn’t it be easier to pass the files into pandas and use the script that you have already developed to analyze that 8,760 hours’ worth of data? (Hint: the answer is “yes”!)
Week 3.
More pandas! This week’s focus is the real-world metered energy data gathered during building operations. The course has been progressing and now getting into more advanced concepts in using pandas. This week, you will be learning how to deal with time-series data. Specifically, you will be learning how to manipulate and clean up a data frame by using functions such as resample, truncate, fillna.
Use Case: Clean up, re-organize, and analyze energy trend data. Immediately, we put that skill into use. We have been working on projects in which we need to gather metered energy data, analyze it, and produce reports. Inevitably, the data sets are not always perfect. It contains missing data (sometimes spans across a few days) and outliers. It is necessary to remove all the noise before the data can be evaluated and provides meaningful insights. What we used to do with macro-powered spreadsheets, we’re now able to do with a single line of code in a cloud-based Jupyter notebook.
The content covered in this week, in my opinion, was the most useful so far.
Week 4.
The data you are dealing with this week is occupant thermal comfort information collected during building operations from many research projects around the world. The main goal of this week is to learn how to use pandas to understand the statistical information of the data set and visualize the data. Data visualization is done using another Python library named Maplotlib. I found it powerful since it is able to produce a wide variety of graph types such as histograms, density plots, scatter plots, and a lot more others.
Use Case: Visualizing the modeling and trend data. Needless to say. Data visualization is important in energy modeling. It is a great way to communicate our findings to the larger team. I know spreadsheets are good at visualizing data, but it is all about workflow. It does take a little more time to create graphics using Matplotlib in the beginning, but once it’s set up, the next data visualization only takes a second as soon as the results get passed in. Similarly, When dealing with real-world metered data, it is also beneficial to visualize where the gaps are by simply doing a boxplot or a histogram.
Week 5.
Finally, we are here! In Week 5, we are getting into a fancier world. This week’s focus is on Machine Learning, a hype in the high-tech industry. Can we borrow it and use it in our building industry. Yes, we can!
ML is complex and deserves its own 5-week (or a semester) learning process. Therefore, what we learned this week only scratches the surface (and admittedly so by the instructor). The course touched briefly on the basics of machine learning and subsequently introduced a Python library called “scikit-learn”. We are able to follow along and build a few basic models where the code can “learn” from the data given date/time, energy consumption, and temperature data. We trained it for a period, and then it predicted the consumption for the last period, which was pretty close!
Use Case: Predict missing energy data. If you remember, we receive huge data sets from building operations that often have outliers and gaps. ML can be used to predict what the energy consumption would have been in those periods with missing information. Using the rest of the data to train and test the model and predicting energy consumption during the missing data period is a more scientific and accurate approach than interpolation.
Here you go, your teaser before getting into the course and doing some challenging coding. Hopefully, you will find it useful. I’m off to write some “for” loops now. Feel free to leave some comments and we would love to learn about your experience with data analysis and programming applied in our industry.