The following guidelines apply to reports in CS354 - Algorithm Design and Analysis.
Technical reports are often built around data. You’ve collected data in some way (often by running experiments), and you are now presenting it, analyzing it, and drawing conclusions from it. The data is central to the whole report, typically.
Experiment/Methodology Section¶
I’ll talk about how to present data in a second, but before you present any data, you need to tell the reader where the data came from. The reader should be able to judge whether your data was collected correctly, and he or she should be able to reproduce your data collection from the description you give. If either of those is not possible given what you write in your report, that is a substantial hole.
So describe your methodology. How did you collect your data? Be specific. “I ran the algorithm several times and collected runtimes.” How many times? How did you measure the runtimes? Was anything different between each time the algorithm was run? And so on… Provide all relevant details… enough for the reader to know exactly what you did and be able to reproduce it. You can assume a basic level of knowledge in the reader, of course, so don’t go overboard or be incredibly pedantic.
Results/Analysis: Charts¶
The purpose of a chart is to present a large amount of data in an easily understandable, meaningful way. Always keep that in mind when creating a chart. The choices you make about its type, the included data, and its formatting all have impacts on how understandable and meaningful it is.
First of all, if you don’t have many data points, a chart is less useful. Five separate numbers can often be presented in a table and be perfectly understandable. A chart isn’t always the best way to present data.
If you have a lot of data, though, charts are an excellent way to present it. They can present dozens or even hundreds of data points in a way that gives the reader useful information that a raw data table would obscure.
So all of this points to the fact that you should think carefully about what you want a chart to “say.” It’s presenting some data you’ve collected, sure, but the chart you create from that data will say something about the data, providing a higher level understanding of it by the way it organizes and juxtaposes the data. Concretely, then, you should think about at least two or three different ways you might put the data in a chart (bar chart, line chart, x-y scatter plot… grouped by this or that… highlighting some aspect of the data or another…), and determine which is the most useful for the reader.
Beyond that, there are some specific rules that should typically be followed:
- Charts are meaningless without axis labels. Each axis means something and typically has some units. If it isn’t labeled, the locations of the points within the chart have no meaning.
- Large numbers are often more readable in scientific notation. Instead of labeling an axis with 1000000000, try 109 or 10e9. Most spreadsheet software will have an option for formatting axis labels in this way. Alternatively, change the units. If your measurements are in nanoseconds but many of the values are in the billions, it will be easier to understand if you present those measurements in seconds instead.
- Likewise, if the labels of one axis are powers of 2, then write them as power of 2; i.e., labels like 215, 216, etc. instead of 32768, 65536, etc.
- Logarithmic axes are very useful when the data span a very large range and changes among the small values are just as meaningful as changes among the large values. For example, if plotting the net worth of every person in the world, the very few billionaires will require an axis that reaches up toward 100 billion. A linear axis from 0 to 100 billion would result in most people’s net worths being indistinguishable, “stuck” up against the other axis and providing no useful information. A logarithmic axis, on the other hand, would make the difference between 100 and 1000 just as evident as the difference between 100 million and 1 billion. The small values would be visible, providing as much information as the large ones.
-
If you have a set of data that are related to a dependent, numeric variable (such as runtimes that depend on the input size), a line chart makes sense. Line charts visually interpolate between values (literally drawing a line between them), suggesting to the reader that other data points exist in those regions, even if you haven’t measured them directly. Put another way, if your data points are samples from what should be a smooth curve, approximating that curve with lines between your data points makes sense.
On the other hand, for data that are related to discrete categories (e.g., popularity of flavors of ice cream), a bar chart is most suitable. Drawing a line or interpolating between discrete categories is nonsensical. What is the value two thirds of the way from “pistachio” to “sea salt caramel fudge truffle?” Don’t suggest, with a line, that there is data there.
-
And finally, charts should be self-sufficient and stand on their own. The reader should be able to look at a chart, without reading any of the related text, and know how to interpret it. The text can, and should, explain any important details and should definitely provide analysis of the chart’s data, but it shouldn’t be necessary for the reader to get a basic understanding of the chart.
Results/Analysis: Analyzing Data¶
So once you have your data collected and presented nicely, you have to analyze it. Keep in mind the overall goal of your report. [Crucial point: If you don’t know what the goal of your report is, you need to figure it out. Ask your instructor if it is not clear.] You’re writing for a particular audience in order to convey some particular information. Your data is part of the message, but your analysis of it is crucial as well.
Say you’re presenting algorithm runtimes (just as a random example). You’ve shown the runtime data you’ve collected. Now what does it mean? Does it match your expectations? What does it say about how the algorithms relate to one another? Are there any surprises? Is there anything unexpected or difficult to explain? Are there any evident trends? Can you make any claims about how the algorithms scale given the data you’ve collected? Answers to any of these questions should be tied to the data you’ve collected, and the reader should be able to see how the data support your claims.
A few additional suggestions about interpreting and analyzing experiment data:
- First, you should start with some idea of what the runtime might be as a function of input size. Do you expect it to be linear, logarithm, n log n, quadratic,…? So to start, you should do some analysis and/or research on the algorithm you’re running.
- Look at the data and try to determine what scaling relationship it shows you for each algorithm. Is it following a linear trend? Is it logarithmic or quadratic? Spreadsheet software can perform curve fitting that can help with this, but you can also eyeball the data or perform simple analyses like looking for a close-to-constant ratio between size and runtime to identify linear scaling, for example.
- Is the data too noisy? Are there large variations instead of a smooth curve? Then you might want to average over more runs to smooth things out. That can clear things up and make differences more obvious.
- Look for outliers. The curves, especially of average runtimes, should be fairly smooth. If there is a spike (up or down) at any point, it’s probably not representative of the algorithm’s real runtime. See if it persists across multiple experiments. If so, then try to look for an explanation.
- If your data just sort of looks wrong, it might be wrong. That’s often a good time to go back and check whether your code has bugs, guiding your debugging based on how the data appears to be incorrect.