Module 3 Creating a Stata Dataset

Before you have a dataset, you must first collect data on a topic of interest. In this module, we will begin creating a dataset by looking at how to extract data we’ve collected in a questionnaire into a structure or format that can be understood and interpreted by Stata. Please note, this module does not teach how to create a questionnaire. It rather shows how to get the data you’ve collected in a questionnaire into a dataset for analysis in Stata. By the end of the module, students should:

become familiar with basic requirements for creating a Stata dataset
be able to input information from a questionnaire into a grid or table format for use in creating a dataset
know the best practices for naming and labeling variables in a dataset

Since most of us would be using what we learn in these modules to complete some of the tasks in our term projects, we will use a case study of a fictitious term project to illustrate what we would be exploring. And, since we will be creating the dataset for this module from a questionnaire, here is a sample questionnaire I created to help illustrate most of the ideas and concepts to be treated. The questionnaire is based on a topic investigating Memorial University students’ perceptions about recent tuition hikes by the Government of Newfoundland:

Respondent id - A number from 1 to infinity

What is your gender - Male - Female - Other, specify - Prefer not to say

What is your level of study - Graduate - Undergraduate - Diploma

What is your tuition fee status - Canadian/Permanent resident - International student

How would you rate the current tuition hikes - Fair - Unfair - I’m not sure

How much will the new hikes affect your studies - Very much - Not at all - I’m not sure

Which of the following is true about you - I am a self funded student - I have a full or partial scholarship - I prefer not to say

The above questionnaire show questions to be posed to respondents on the topic the study seeks to investigate - perceptions of tuition hike by the Government of Newfoundland among Memorial University students. This questionnaire can be converted to a data set structured in a grid or spreadsheet/table format. Many statistical packages like Stata require data that are set in a grid, like a table, with rows and columns. Each row contains data for an observation (which is often a subject or a participant in your study), and each column contains the measurement of some variable of interest about the observation, such as age

For some variables, the meaning of the values in the data is obvious. For instance if your variable explore people’s year of birth, then when you see the values, it is clear what they mean. For other variables, the meaning of the values needs to be specified with a value label. There are many questionnaires that explore whether respondents “Strongly agree,” “Agree,” “Disagree,” or “Strongly disagree,” or that ask respondents to answer some question with either a “Yes” or “No.” This sort of data is usually coded as numbers, such as 1 for “Yes” and 2 for “No,” and it can make understanding tables of data much easier if we create value labels to be displayed in the output in addition to (or instead of) the numbers.

3.1 Creating a Coding Schema

Software like Stata requires the data we bring into them to be set in a format they understand and can interpret. Doing this means that we must sometimes convert or translate the data from a format that a particular software - in this case Stata - does not recognize to one that it does. In this module we will explore how to take data from our sample questionnaire and create a codebook that can be used in creating a data set that Stata can understand and interpret.

Looking at the sample questionnaire we developed earlier, it is clear that nearly all the characters are strings. Hence, in order to be able to get the data we collect in the questionnaire into Stata for our analysis, we must find a way to structure the data into a form that Stata can understand and interpret Doing this is called coding, where we assign special numeric or string characters to questions and answers to enable analysis to be done in Stata. In this module we will convert the answers in our questionnaire above into numeric characters. Doing this means that answers such as those of the first question, i.e, about Gender, can be assigned numeric characters such as Male = 1, Female = 2, Other = 3, Not sure = 4. Doing this, we can assign a number for each possible answer, or a response for each of the items in our questionnaire.

We can also assign short variable names for the variable that will contain the data for each item. A rule of thumb for creation of variable names is that we can assign uppercase and lowercase letters, numbers, and underscore character, but the name should have less than 32 characters and contain no spaces. Below is a sample Codebook I created using information from our questionnaire:

ID Number	Question	Variable name	Variable label
Identification number	id	record order	1 - nth
What is your gender	gender	Male, Female, Other, Prefer not to say	1, 2, 3, 4
What is your level of study	Study_level	Graduate, Undergraduate, Diploma	1, 2, 3
What is your tuition fee status	fee_status	Canadian/Permanent, Resident International	1, 2
How would you rate the current tuition hikes	tuition_hike	Very Fair, Fair, I’m not sure, Unfair, Very Unfair	1, 2, 3, 4, 5
How much will the new hikes affect your studies	effects	Very affected, Affected, I’m not sure, Slightly affected, Not affected all,	1, 2, 3, 4, 5
How is your study funded	fee_type	I am self-funded, Have scholarship, I prefer not to say	1, 2, 3

The codebook above help translate the numeric codes in the data set back into the questions that were asked participants and the answer options they could choose from. Without the codebook, it would be difficult to determine, for instance, that every International Student among our participants, is assigned a 2 in the data set.

3.2 Translating our Data set with a coding sheet

From the sample codebook developed in the previous section, and which defines how questions asked in the questionnaire and answers provided by respondents should be structured or represented in a computer readable format, we prepare a coding sheet that would take the data from its current state in the questionnaire and translate it to a computer readable format to be used for analysis in Stata. In this sense, the coding sheet acts as our translator to convert the dataset in the questionnaire to a format that is recognized and understood by Stata.

The coding sheet may be created using a spreadsheet application like Microsoft Excel or even a text editor with support for comma delimited files such as Notepad++. However, Stata has a special application for data entry known as the Data Editor (see figure 1.2). It provides an interface that is similar to a spreadsheet, but it has some features that are more suited to creating datasets specifically for use in Stata. Figure 1.3 below shows the interface for Stata’s Data Editor:

Figure 1.3. Data Editor

In this module, we will be using Stata’s Data Editor to create a coding sheet from the questionnaire above. The Data Editor may be opened in both Edit and Browse modes. Opening in Edit mode allows the editing of data sets. On the other hand, one can simply browse or preview the data set in the Browse mode. To open the Data Editor in Edit mode, click on the icon on the toolbar that looks like a table or spreadsheet with a pencil pointing downward towards the far left corner (see figure 1.4 below).

Figure 1.4. Data Editor icon

We can start editing as soon as the Data Editor comes up. Just like other spreadsheets, data may be entered directly into the various columns. Let’s enter a data for a fictitious first respondent who answered the first question as Male, the second as graduate, the third as Canadian, fourth as “I am not sure,” the fifth as “not at all” and the sixth as “I’m on scholarship.” To do this:

click on the first cell at the top left corner and enter 1 as the value for our respondent’s id.
Press the Tab key to move one cell to the right, and enter the value for the respondent’s gender, which again is a 1.
Press the Tab key again to move to the next cell on the right and enter 1 again as value for fee status which he answered as Canadian.
Make another Tab and in the next cell enter 3 as value for tuition hike.
Tab again to the right cell and enter 2 as value for the effects variable.
Finally repeat the tabbing process and enter 2 as value for fee_type which describes him as being on full or partial scholarship.

The results of the data entry procedure we just completed may be viewed below.

Figure 1.5.

After you have entered all the values for a specific respondent, you then press the Enter key, which will move the cursor to the next row on the Data Editor (Edit), and then press the Home key, which will move the cursor to the first column of this row. You can proceed to enter the rest of the data. The Tab key is smart, so after you enter the last value for a specific row of data, pressing Tab will move the cursor to the first column of the next row.

However, before doing this it would be good to enter more informative variable names to replace the default names of var1, var2, etc. To change a variable’s name:

click anywhere in the column for that variable or click on its name in the Variables pane at the top right of the Data Editor, for example, var1.
then in the Properties pane at the lower right of the Data Editor, change Name from var1 to a specific variable name, in our case, id.

Remember, the name can be up to 32 characters, but it is standard practice to use short names. Since Stata is case sensitive, it is also best to use lowercase names. By using lowercase names, it will be easy to remember what case to use because the names will all be lowercase. From the Properties pane, we can also change the variable’s label to be “Participant identification” by typing this text in the Label box. After completing these routines for 20 fictitious respondents, our Data Editor (Edit) should look like figure 1.6 below:

figure 1.6. Data Editor showing manually typed data

The above image shows the Data Editor (Edit) with manually entered variables and values from our questionnaire. But it is possible that the data set in its current state may confuse certain users. To redress this problem, it is necessary to create descriptive labels for some values on most of the variables to help with clarity. We do this in the next module