Thoughts on Large Data Sets

One of the thoughts that came out of my most recent meeting with Simon was that the choice of exam board will be influenced by the large data set. I had previously been of the opinion that I could leave the choice until January 2018, seeing if any more specimen/mock papers became available and analysing question types. However this would mean not spending as long familiarising students with the specific large data set for whichever exam board we choose. As a result of this I have downloaded the data sets for AQA, Edexcel and both specifications of OCR. I should point out that I am not a statistician, I have taught S1 once and try to avoid it if I can!

I have started to look at the data sets to see which is most useable, and which students will be able to best gain insight into for reproduction in their exams. We want to be revisiting the data constantly, so that students are really familiar with it. This means that portability is important as we will not always be able to access computer facilities.

AQA – Purchased quantities of household food & drink by Government Office Region and Country

The data given is split into 10 regions (under separate tabs), with the average amounts of various foods and drinks per person per week. There is also a tab with averages for the whole of England. Having spent some time in Excel playing around with the data it is possible to fit each region onto a single sheet of A3 paper (total of 11 sheets).

AQA 1Looking at the questions in the specimen paper, students are expected to be able to recall information about the average amounts of certain food groups from different regions. This is something that could only be known by someone who has done extensive work with the data set before, and given the sheer scale of the data is unlikely to be something that you could repeat for all of the different food groups.

AQA2Later questions involving the data set give a small excerpt and ask questions about these. These are much more accessible to students who do not have as much familiarity, but will be easier for those who are aware of the context. For example there is question about the total amount of confectionery purchased, which does not state that it is based on averages.

Total Marks based on Large Data Set in AS Spec Paper: 9 (Out of 80 on paper 2, 160 across the AS)

OCR A – Method of Travel / Age Structure

The OCR A specification looks at the methods of travel to work, broken down into regions, taken from the national census in 2001 and 2011 (separated into two sheets). There is also data about the ages of the residents of the regions (2 further separate sheets). Each tab can be set to cover three A3 pages, so a total of 12 will be needed for a portable copy.

OCRIn the question pictured here it would be advantageous to be familiar with the data set, particularly for part (ii), as there are different codes for the authorities based on their type. If you knew this then you would know how to separate the authorities further and would merely have to explain this.

For the other question based on the data set (not pictured), a summary table has been created. It is not as obvious what the benefits to knowing that data are here, although general familiarity and having looked at possible summary statistics will help.

Total Marks based on Large Data Set in AS Spec Paper: 8 (Out of 75 on paper 1 and 150 across the AS)

OCR B (MEI) – Population data and Olympic success

The first thing to note here is that the MEI specification (OCR B) has taken a very different position to the other boards. There will be three different data sets that will be used in rotation. The data sets that will be used for ‘live’ specifications are not available yet.

MEIThe data set that is available for the specimen papers is far less ‘large’ than the others, reducing to two A3 sheets. The question included here really grabbed me as being interesting – what were the outliers in Sub-Saharan Africa? On inspection, the data that stood out was that from islands, rather than countries on the continent.

This data set seems much more manageable than the others, and over two years I would expect students to be able to become very familiar with it.

Total Marks based on Large Data Set in AS Spec Paper: 7 (Out of 70 on paper 2 and 140 across the AS)

Edexcel – Weather Data

Edexcel’s weather data consists of 5 weather stations in the UK and 3 from abroad, with readings from both 1987 and 2015. I have been able to fit the data for each station, for a single year, on one A3 sheet (total 16 sheets).

EdexcThe questions based on this data set again seemed to not require much detailed knowledge of the readings. In the question shown here it is only the fact that there is one reading per day that will help with part (b).

Of course, as Edexcel has not been accredited yet, this may change.

Total Marks based on Large Data Set in AS Spec Paper: 11 (Out of 60 on paper 2 and 160 across the AS)


While the use of the data set will only form part of my decision on which exam board to use, I have found the process of sifting through the data sets, and the questions that relate to them, extremely useful. It has also shown me the benefits of this approach. In starting to look at the data sets it is already noticeable how the data is starting to feel familiar. I think that this will develop much more ownership of the data and make structuring easier. Now students know they are expected to know the data set, they are more likely to see the value in using it as part of exercises.


