Introduction to Data
Overview
Data Migration
- A practical way to conceptualize what data is (this article)
- Security considerations when migrating data (article 3 in this series)
- How things can break during a data migration (article 3 in this series)
Your content goes here. Edit or remove this text inline or in the module Content settings. You can also style every aspect of this content in the module Design settings and even apply custom CSS to this text in the module Advanced settings.
Defining Data
Low Level
- Binary 1’s and 0’s
- A drives magnetic polarity
- Bits & bytes
Interactable Objects
- Text documents & photos
- Databases
- Your email inbox
Conceptual
- Predicting future revenue
- A detailed presentation
- Encoding & Encryption
Classifying Data
While the term data is pretty nebulous, it is pretty safe to mentally substitute it with the word ‘information’. When most of us talk about data, we are talking about a concept representing a collection of related data (e.g. Customer data, data breach, data recovery, research data).
Being a networking guy, I made a chart inspired by the OSI model:
Characteristics of Data
Qualitative Data
- Also called Categorical Data
- Not objective
- Not explained with math
Examples:
- Filetypes
- Documents, Photos, Emails
- Recorded subjective data
- Opinions, y/n polls, colours
Quantitative Data
- Also called Numerical Data
- Measurable & statistical
- Defined with numbers
Examples:
- Snapshots of data
- Current salary, height, age
- Continuous data
- Career salary, height & age over time
There is a schism in the data science field on whether you can make objective data out of subjective data using math to transform it. This is more philosophical than technical so we can move on with the understanding that data is either non-numerical or numerical in nature.
Relationships of Data
Sometimes, data is relational, and one data object might also exist within or affect another data object. You might be surprised to learn that non-relational data are data objects that do not directly correlate or influence each other. This is a bit more precise than our definition of data itself, but you have to be careful not to confuse a non-relational dataset that is made up of data that appears to be related.
Example A
I tend to think of data relationships and structures like a network router or firewall configuration.
In most routers I’ve worked with, to configure a DHCP server on a VLAN, you somehow get into a global configuration mode and then navigate through nested areas (ex: system -> interfaces -> DHCP -> servers -> pool) until you can configure a DHCP scope.
The path I take to configure the static ip might be very different (ip -> interfaces -> switchport -> mode -> static).
Even though we think of the DHCP server as inside the VLAN, it’s a completely different area in configuration. Now, I have to build a route for that VLAN, work with an ACL, and potentially assign it to a VRF, and, and, and, etc.
That would be an example of non-relational data because I’m configuring my data in multiple places, and that information is not updated elsewhere.
Example B
You have three spreadsheets:
- Spreadsheet A, which has a list of DIY house projects you have done and plan to do and a unique identifier for each project
- Spreadsheet B, which has a list of friends that have helped and/or plan to help and a unique identifier for each friend
- Spreadsheet C, which has a list of unique identifiers for completed/uncompleted projects and the unique identifiers of friends who helped/planned to help with each project
You could add complexity by introducing a project status (completed, on hold, planned). We will discuss databases more in-depth in a later article; this information is just to give additional context to the example.
Storing Data
Without getting too into the weeds, we will focus on two dominant methods:
- Object Storage (S3, Azure Blob, etc.)
- Block Storage (SAN, NTFS, etc.)
Object Storage
All encoded unstructured data is stored as a data object. Every file is a data object, even if you were to upload a folder into a bucket. The folder would no longer exist, and the file contents would appear in the bucket.
To keep things scaleable, data objects are assigned tags and metadata to be accessed easily. This is typically cloud-native, but it does not have to be. In the next article, we will host an S3 bucket locally.