Introduction to Data

Disclaimer: I am NOT a data scientist, numerical methods/statistics guy, Machine Learning Engineer, or anything remotely related to an expert on data. I am going outside my comfort zone here and sharing my knowledge about data, which could be 100% wrong. If something needs to be corrected here, please let me know.  

Overview

As users’ or businesses’ needs change, the underlying infrastructure needs to be agile enough to meet the new needs. At some point, all hardware and software needs to be updated or upgraded. For server operating systems, databases, workstations, applications, etc., this can mean moving them to a new host or moving them into the cloud.
This is the first article in our migrating data series and will focus on the fundamentals of data migration. We are focusing on data concepts and theory in this article, and uncharacteristically, there will not be hands-on lab experience in this article. Sadly, even though the information here is as much of a condensed crash course as possible, there is a lot we need to cover. The following article in this series will get into the meat and potatoes and primarily comprise the related labs for everything we go over here. I encourage you to use both articles together to gain a complete understanding.  

Data Migration

Managing how to get data from point A to point B is a career in and of itself, although you have to be superhuman to work in that space. Fortunately, most of us (allegedly) non-reptilians do not have to get too far in the weeds. Unfortunately, a lot of vague terminology is used interchangeably, which makes things confusing for a layman.  
What we need to know before migrating data:  
  1. A practical way to conceptualize what data is (this article)
  2. Security considerations when migrating data (article 3 in this series)
  3. How things can break during a data migration (article 3 in this series)

Your content goes here. Edit or remove this text inline or in the module Content settings. You can also style every aspect of this content in the module Design settings and even apply custom CSS to this text in the module Advanced settings.

Defining Data

When someone talks about data, I have to rely on context clues to understand what they are trying to say. It isn’t very clear if you do not know what type of data they are talking about.
 
For example, is data:
Low Level
  • Binary 1’s and 0’s
  • A drives magnetic polarity
  • Bits & bytes
Interactable Objects
  • Text documents & photos
  • Databases
  • Your email inbox
Conceptual
  • Predicting future revenue 
  • A detailed presentation
  • Encoding & Encryption

Classifying Data

There is no great way to define data that I have found, but I picture it as Russian nesting dolls, where the smallest doll might be the electrical pulses of a circuit, and every doll after that adds a layer of complexity. 

While the term data is pretty nebulous, it is pretty safe to mentally substitute it with the word ‘information’. When most of us talk about data, we are talking about a concept representing a collection of related data (e.g. Customer data, data breach, data recovery, research data).  

Being a networking guy, I made a chart inspired by the OSI model:

Characteristics of Data

We can use some industry frameworks to understand different types of data. You can think of all data as quantitative (statistical and analyzed via numerical data) or qualitative (contextual, potentially subjective). This is very simplistic, and things are way more complicated but qualitative/quantitative is all that most of us need to know.  
Qualitative Data
  • Also called Categorical Data
  • Not objective
  • Not explained with math
      Examples:
      • Filetypes
        • Documents, Photos, Emails
      • Recorded subjective data
        • Opinions, y/n polls, colours
      Quantitative Data
      • Also called Numerical Data
      • Measurable & statistical
      • Defined with numbers
          Examples:
          • Snapshots of data
            • Current salary, height, age
          • Continuous data
            • Career salary, height & age over time
          You are in good company if you immediately start thinking of how you could technically tally up categorical data and provide the sum as numerical data. For years, I thought any numerical representation of categorical data would be considered numerical data because it is quantitative. This is another vague, nebulous thing because, technically, the data you are describing is still categorical; your numerical analysis is still qualitative data. 

           

          There is a schism in the data science field on whether you can make objective data out of subjective data using math to transform it. This is more philosophical than technical so we can move on with the understanding that data is either non-numerical or numerical in nature.  

          Relationships of Data

          If you are working with any developers or database folks, they will use many terms that don’t mean anything. *Disclaimer: network engineers have an eternal war with devs and dba’s similar to the conflict between angels and demons. Your goal is to be conversational enough to understand the end goal without getting trapped in wonderland. 

          Sometimes, data is relational, and one data object might also exist within or affect another data object. You might be surprised to learn that non-relational data are data objects that do not directly correlate or influence each other. This is a bit more precise than our definition of data itself, but you have to be careful not to confuse a non-relational dataset that is made up of data that appears to be related.

          To expand on the earlier one liners on the data relationship: Data is determined as relational or non-relational by the data storage method.
           
          Relational methods like SQL databases inherently deduplicate data because if it exists in one column, any other tables with that data will reference the original data. Relational data is typically also structured data, meaning that it follows a strict template and structure, making how the data is stored and manipulated very consistent.
           
          Non-relational methods like NoSQL databases are designed for unstructured data like free-form text and images. There is no deduplication, so if something exists in one table, any other tables will have their own copy of that data in a potentially different format.

          Example A

          I tend to think of data relationships and structures like a network router or firewall configuration.

          In most routers I’ve worked with, to configure a DHCP server on a VLAN, you somehow get into a global configuration mode and then navigate through nested areas (ex: system -> interfaces -> DHCP -> servers -> pool) until you can configure a DHCP scope.

          The path I take to configure the static ip might be very different (ip -> interfaces -> switchport -> mode -> static).

          Even though we think of the DHCP server as inside the VLAN, it’s a completely different area in configuration. Now, I have to build a route for that VLAN, work with an ACL, and potentially assign it to a VRF, and, and, and, etc. 

          That would be an example of non-relational data because I’m configuring my data in multiple places, and that information is not updated elsewhere.

           Now, in a newer router or SDN that probably has a fancy GUI, I can configure a VPN, and information that I change on the VPN is updated elsewhere (routing, policy, tunnel interface, name, etc. ).

          Example B

          You have three spreadsheets:
          • Spreadsheet A, which has a list of DIY house projects you have done and plan to do and a unique identifier for each project
          • Spreadsheet B, which has a list of friends that have helped and/or plan to help and a unique identifier for each friend
          • Spreadsheet C, which has a list of unique identifiers for completed/uncompleted projects and the unique identifiers of friends who helped/planned to help with each project
          In the database world (imagine the horror), spreadsheet A and spreadsheet B are called reference tables, which make data consistent and quickly useable. Spreadsheet C would be called a junction table as it is the culmination of the reference tables and represents the relationship between spreadsheets A and B. 

          You could add complexity by introducing a project status (completed, on hold, planned). We will discuss databases more in-depth in a later article; this information is just to give additional context to the example.  

          Storing Data

          With the advent of cloud and hybrid-cloud environments, we will see many more tasks that require on-prem data stored in a public cloud. There are many different methods you can use here, and there is a lot of background on data storage.
          Without getting too into the weeds, we will focus on two dominant methods:  
          • Object Storage (S3, Azure Blob, etc.)
          • Block Storage (SAN, NTFS, etc.)

          Object Storage

          Object Storage stores data objects as globally unique, independent objects inside an abstracted storage pool (usually) called a bucket or container.
           

          All encoded unstructured data is stored as a data object. Every file is a data object, even if you were to upload a folder into a bucket. The folder would no longer exist, and the file contents would appear in the bucket.

          To keep things scaleable, data objects are assigned tags and metadata to be accessed easily. This is typically cloud-native, but it does not have to be. In the next article, we will host an S3 bucket locally.  

          Block Storage

          Block Storage stores raw data as globally unique addressable byte sized blocks on the storage medium.
           
          All data is divided into the configured block size (4KB usually), and those blocks are then assigned a unique address and indexed by the storage controller before being written on some disk (USB, SSD, vSAN, etc.).
           
          Block storage is a very precise and fast method of storing data, but that does make it less scalable and not suited for big data, unlike object storage. This is typically on-prem physical hardware but it does not have to be. We will host a block storage device in the cloud in the next article.  

          Defining Data Summary

          Data science is a massive field in the IT world and most of us normal people will not unlock all of its mysteries. You can call a lot of different things ‘data’ but to make things easier you can think of it in terms of qualitative or quantitative in nature.
           
          Most of us work with unstructured data like word docs and music, but you may work with structured data at some point like a relational database or a financial spreadsheet.
           
          Typically we are using block storage in our day to day lives working on laptops and our phones, but when we use a cloud service like iCloud or OneDrive we are accessing object storage.  
          Most importantly, data is pronounced day-tuh (long a – like Danger) NOT dat-uh (short a – like cat). Do not listen to anyone who pronounces it incorrectly.