Many UX designers are somewhat skeptical about handling data, believing it requires in-depth statistical and mathematical knowledge. It is valid for advanced data science; it is not right for the fundamental research data analysis that most UX designers need. We live in an extensively data-driven world, and basic knowledge of data is useful for any professional along with UX designers.
This article deals with Data Concepts, structured and meaningful data that you can present in a table, within columns and rows. Unstructured data is an entire subject by itself. Analyzing unstructured data is more challenging. If you can represent the structured data in the form of a table form, the basic concepts stand as follows:
Dataset refers to the whole set of data that you want to analyze. For example, an Excel table. Another accessible format to store datasets is the most convenient comma-separated value file - CSV. These are simple text files that you can use to keep information in the form of a table. Every CSV row is corresponding to an individual row in that table, and the values in each CSV row stands naturally demarcated by commas, which are corresponding to the table-cells.
A data point is a single row in a dataset table. It implies that a collection of data points is termed as a dataset.
An individual value coming from a data-point row represents a data variable, i.e., a table-cell. You will get mainly two kinds of data variables:
- Qualitative variables.
- Quantitative variables.
Qualitative variables/ categorical variables come with a distinct collection of values, e.g. color = red/blue/green. Quantitative variables include numerical values, e.g. height = 156. A quantitative variable has the potential to take any amount, which is not the case with a qualitative variable.
Creating Your Data Project-
Now you are well aware of the basics. Hence, you can now get your hands into the process of creating your first data-project. Let us consider that the scope of the project is to analyze a dataset via a thorough study of the whole data flow of data-import, data-process, and data-plot. You will first need to choose your dataset and then download and install the tools you will use to analyze the data.
E.g., if you choose a car dataset, your focus will be on data flow and tools. You can download a used car dataset from one of the most prominent sources providing free datasets: Kaggle.
First, you will have to register. After you have downloaded the file, open it, and have a thorough look at it. It is a huge CSV file, but you must make sure to understand its gist.
The data point has multiple variables that commas will be separating. After having set the dataset, you need to get down to choosing and working with your tools.
Tools Of The Trade-
Following this example, you will be using the R language and RStudio to analyze this specific dataset. R is a popular and easy-to-learn language. Not only data scientists use it, but also people who deal with financial markets, medicine, and other areas, use it. R projects are developed in the environment of RStudio. You will get a free version of it, which will sufficiently cater to your needs as a UX designer.
Many UX designers prefer to use Excel in their data-workflow. Likewise, since R is easy to learn and more flexible and powerful than Excel, as a web designer for hire, you will like it. When you add R to your tool kit, you will witness a difference.
Installing The Tools-
First, download and install R along with RStudio. You have to install R first, followed by RStudio. The installation processes for both the tools are seamless and secure.
Once you complete the installation process, get on with creating a project folder. Continue by creating a subfolder called data within the project folder. Now, copy the dataset file that you have downloaded from Kaggle into that folder. Rename it to used-cars.csv. Soon you can return to your project folder:used-cars-prj. Create a plain text file: used-cars.r.
Now your folder structure is in place. You can open RStudio to create a new R project. Click on the New Project project option in the File menu. Select the second option: Existing Directory. Now, select the project directory and click on the "Create Project" button. Your work is done. Once you have created the project, proceed to open used-cars.r in RStudio. You will be adding all your R code in this file.
You will now be adding your first line in used-cars.r, to read data from the "used-cars.csv" file. You must note that CSV files are only plain text files that help to store data.
Your first line of R code will look something as follows:
cars <- read.csv("./data/used-cars.csv", stringsAsFactors = FALSE, sep=",")
The read.csv function considers three parameters-
- The file to read is located in the data folder.
- The stringsAsFactors=FALSE is set to make sure strings like "BMW" or "Audi" are not converted to factors.
- sep="," specifies the type of separator applied to separate values in the CSV file: a comma.
After you have read the CSV file, the data is stored into the data frame object for the car. A data frame is a two-dimensional data frame which is highly elementary in R to manipulate data. After you have introduced the line and have run it, a car's data frame will be created for you. At the top-right quadrant in RStudio, you will find the data frame of the in the Data section in the Environment tab. When you double click on the car's option, a new tab will open in the top-left quadrant of RStudio and will present the car's data frame.
Processing refers to removing, transforming, or adding on information to our dataset, for preparing for the type of analysis you wish to perform. You have your data in a data frame object, so now you will be installing the dplyr library, which is a robust library that helps manipulate data. To install the library in your R environment, you need to write the following line at the top of your R file.o
For adding the library to your current project, you will write the next line:
Once you have added the dplyr library to your project, you can now start processing the data. You currently have a huge dataset, and you will need only the data representing the same car maker and model, to correspond with the price.
You will use the following R code to keep only that data, which concerns the BMW 3 Series and eliminate the rest. You can opt for any other manufacturer and model from the dataset, and hope to get the same data features.
cars <- cars %>% filter(Make == "BMW", Model == "3")
You have access to a more manageable dataset now. The purpose is to analyze the price, age, mileage distributions of the cars; along with the correlations between them. For that, you need to keep only "Price," "Year," and "Mileage" columns and remove the rest. You can do it via the following line-
cars <- cars %>% select(Price, Year, Mileage)
Your data, now, has the right shape. So, you can get on with making some plots. You must remember that you will be focussing on two aspects: individual variables' distribution and the correlations among them. Variable distribution helps you understand what an average or high price for an already used car or the proportion of vehicles over a specific amount is. The same applies to the age and mileage of the cars. On the other side, correlations help understand how variables, e.g., age and distance, are related to each other.
You will be using two kinds of data visualization:
- histograms for variable distribution.
- scatter plots for correlations.
You can plot the car price histogram in the R language like this:
With RStudio, you can run your code line by line; for example, in our case, you need to run only the heading above to showcase the histogram. You may not necessarily rerun the entire code progression since you ran it once already.
Just like the cars' prices, you will be using a similar line for plotting the age histogram of the cars.
The R code and the histogram for mileage are as follows:
For correlations, you must take a closer look at the age–price correlation of the car. You can expect the price to be negatively correlated with the age — WIth the increase in the age of the car, its price will decrease. You will use the R plot function to display the price–age correlation that will like this:
With the mileage–age correlation in mind, you can expect the mileage to increase with age, implying a positive correlation. The code stands as follows:
Also, you will witness a negative correlation between mileage and the price of cars, which means that increasing mileage reduces the price.
From Numbers To Data Visualization-
With this example, you have implemented two types of visualization:
- histograms for data distributions.
- scatter plots for data correlations.
After you have put your hands into the entire process and gone through the whole data flow of importing, processing, and plotting data, you will understand things much better now. You can easily apply the same data flow to any new dataset that you will come across.
Related:- Web Designers best sources of inspirations.