Introduction

Project Goal

This project aims to implement a multi-stage cleaning workflow that transforms a messy, unstructured dataset by performing fuzzy string matching, null-value imputation, and data structure specification.

          Animal type  Country  Weight kg  Body Length cm          Gender  \
0                 NaN      NaN        NaN             NaN             NaN   
1                 NaN      NaN        NaN             NaN             NaN   
2      European bison   Poland    930.000           335.0            male   
3      European bison   Poland    909.000           311.0  not determined   
4     European bisonâ„¢   Poland    581.000           277.0          female   
...               ...      ...        ...             ...             ...   
1006     red squirrel   Poland      0.346            20.0          female   
1007         hedgehog  Germany      1.000            23.0          female   
1008         hedgehog  Germany      0.500            17.0          female   
1009     red squirrel   Poland      0.346            20.0          female   
1010         hedgehog  Hungary      0.900            16.0          female   

      Animal code   Latitude  Longitude Animal name Observation date  \
0             NaN        NaN        NaN         NaN       03.01.2024   
1             NaN        NaN        NaN         NaN       03.02.2024   
2             NaN  52.828845  23.820144       Szefu       01.03.2024   
3             NaN  52.830509  23.826849         NaN       01.03.2024   
4             NaN  52.834109  23.807093         NaN       01.03.2024   
...           ...        ...        ...         ...              ...   
1006          NaN  52.212001  21.033187        Lola       7 May 2024   
1007          NaN  49.561356  11.105334         NaN       7 May 2024   
1008          NaN  49.561569  11.087046         NaN       7 May 2024   
1009          NaN  52.212001  21.033187        Lola       7 May 2024   
1010          NaN  47.509860  18.939943         NaN       8 May 2024   

     Data compiled by  
0       James Johnson  
1       James Johnson  
2        Anne Anthony  
3        Anne Anthony  
4        Anne Anthony  
...               ...  
1006     Anne Anthony  
1007       Bob Bobson  
1008       Bob Bobson  
1009     Anne Anthony  
1010     Anne Anthony  

[1011 rows x 11 columns]

==============================================

Data Types:
 Animal type          object
Country              object
Weight kg           float64
Body Length cm      float64
Gender               object
Animal code         float64
Latitude            float64
Longitude           float64
Animal name          object
Observation date     object
Data compiled by     object
dtype: object

==============================================

Nulls:
 Animal type           20
Country               12
Weight kg             27
Body Length cm        27
Gender                19
Animal code         1011
Latitude              98
Longitude             98
Animal name          959
Observation date       0
Data compiled by       0
dtype: int64

==============================================

Duplicates:
 167

==============================================

         Weight kg  Body Length cm  Animal code    Latitude   Longitude
count   984.000000      984.000000          0.0  913.000000  913.000000
mean     39.745503       39.107724          NaN   49.393369   18.203280
std     156.290076       58.628601          NaN    7.168900    3.899601
min      -0.252000      -19.000000          NaN  -78.582973   11.074008
25%       0.293000       19.000000          NaN   48.186913   14.384559
50%       0.331500       21.000000          NaN   49.560723   18.944015
75%       0.800000       23.000000          NaN   52.212433   21.033243
max    1100.000000      350.000000          NaN   52.853843   34.896734
BEFORE: 

animal_type :  [nan 'European bison' 'European bisonâ„¢' 'European bisson'
 'European buster' 'lynx' 'lynx?' 'red squirel' 'red squirrel'
 'red squirrell' 'hedgehog' 'wedgehod' 'ledgehod']
country :  [nan 'Poland' 'Hungary' 'PL' 'Germany' 'Slovakia' 'Czech Republic' 'Czech'
 'Hungry' 'HU' 'Austria' 'Australia' 'CZ' 'DE' 'CC']
gender :  [nan 'male' 'not determined' 'female']
animal_name :  [nan 'Szefu' 'Basia' 'Lolek' 'Lola' 'Bob Bobson' 'Klucha' 'Zuzia' 'Bibi'
 'Sissi' 'Puchatek']
data_compiled_by :  ['James Johnson' 'Anne Anthony' 'Bob Bobson' 'John Johnson']

==============================================

Austria
     animal_type  country  weight_kg  body_length_cm  gender   latitude  \
164         NaN  Austria      0.263            22.0    male        NaN   
296    hedgehog  Austria      0.700            19.0  female  47.781139   
297    hedgehog  Austria      0.800            20.0    male  47.781828   
298    hedgehog  Austria      0.600            14.0  female  47.781060   
299    hedgehog  Austria      0.700            24.0    male  47.780872   
..          ...      ...        ...             ...     ...        ...   
774    hedgehog  Austria      0.900            25.0    male  47.317974   
775    hedgehog  Austria      1.100            22.0    male  47.316562   
776    hedgehog  Austria      1.200            24.0    male  47.318191   
803        lynx  Austria     20.000            90.0    male  47.781188   
804        lynx  Austria     26.000            91.0  female  47.780999   

     longitude animal_name observation_date data_compiled_by  
164        NaN         NaN       05.03.2024       Bob Bobson  
296  14.382004         NaN       14.03.2024       Bob Bobson  
297  14.385001         NaN       14.03.2024       Bob Bobson  
298  14.386807         NaN       14.03.2024       Bob Bobson  
299  14.384460         NaN       14.03.2024       Bob Bobson  
..         ...         ...              ...              ...  
774  11.187472         NaN       19.04.2024       Bob Bobson  
775  11.191160         NaN       19.04.2024       Bob Bobson  
776  11.189484         NaN       19.04.2024       Bob Bobson  
803  14.385356         NaN       24.04.2024       Bob Bobson  
804  14.382120         NaN       24.04.2024       Bob Bobson  

[74 rows x 10 columns]
Australia
       animal_type    country  weight_kg  body_length_cm          gender  \
295      hedgehog  Australia      0.800            24.0  not determined   
370      hedgehog  Australia      0.600            14.0          female   
583  red squirrel  Australia      0.266            20.0          female   
688  red squirrel  Australia      0.310            23.0            male   

      latitude  longitude animal_name observation_date data_compiled_by  
295  47.781689  14.386777         NaN       14.03.2024       Bob Bobson  
370  47.781248  14.382649         NaN       21.03.2024       Bob Bobson  
583  47.781632  14.382365         NaN       05.04.2024       Bob Bobson  
688  47.856092  15.083547         NaN       11.04.2024       Bob Bobson  

==============================================

AFTER: 

animal_type :  [nan 'European bison' 'lynx' 'red squirrel' 'hedgehog']
country :  [nan 'Poland' 'Hungary' 'Germany' 'Slovakia' 'Czech Republic' 'Austria']
gender :  [nan 'male' 'not determined' 'female']
animal_name :  [nan 'Szefu' 'Basia' 'Lolek' 'Lola' 'Bob Bobson' 'Klucha' 'Zuzia' 'Bibi'
 'Sissi' 'Puchatek']
data_compiled_by :  ['James Johnson' 'Anne Anthony' 'Bob Bobson' 'John Johnson']

Decisions Taken

  1. Made an assumption that all bisons within the dataset were instances of 'European bison', given the lack of acknowledgement of the American bison.
  2. Making an assumption of 'Austria' or 'Australia' needing to be mapped is an instance of data corruption. After investigating instances of both countries, I found out that the Australia data entries contained the coordinates of Austria. This led me to make the decision to replace 'Australia' with 'Austria'.
  3. There was no need to map Slovakia as it had no other representations.
  4. No other categorical datatypes were mapped given that they were either genders (which had no anomolies) or names.
Negative Weight Instances:  5
Negative Length Instances:  6
Negative Weight and Length Instances:  0

==============================================

[ nan  53.  49.  52.  48.  51.  50.  -2.  47. -79.]

Decisions Taken

  1. My first assumption was that both weight and height of the same entry might be negative due to a possible measurement system error. After verifying, these results were turned into NaN to be handled during null imputation.
  2. I ensured that the anomolous latitude values (-2, -79) were not close to any other values in order to confirm anomolous entries. These results were turned into NaN to be handled during null imputation.
Latitude:
                       min        max
country                             
Austria         47.316383  47.857586
Czech Republic  49.937340  50.884288
Germany         48.160432  52.588834
Hungary         47.505768  48.511532
Poland          50.041119  52.853843
Slovakia        48.183848  49.057758

Longitude:
                       min        max
country                             
Austria         11.184790  15.083547
Czech Republic  14.177877  14.850140
Germany         11.074008  13.152811
Hungary         18.939062  34.896734
Poland          19.949320  23.919668
Slovakia        17.065508  22.425527


Hungary:
       animal_type  country  weight_kg  body_length_cm gender  latitude  \
202  red squirrel  Hungary      0.296            19.0   male       NaN   
213  red squirrel  Hungary      0.296            19.0   male       NaN   
253  red squirrel  Hungary      0.282            21.0   male       NaN   
263  red squirrel  Hungary      0.282            21.0   male       NaN   

     longitude animal_name observation_date data_compiled_by  
202  34.896734         NaN       07.03.2024     Anne Anthony  
213  34.896734         NaN       07.03.2024     Anne Anthony  
253  34.896734         NaN       11.03.2024     Anne Anthony  
263  34.896734         NaN       11.03.2024     Anne Anthony  

==============================================

Weight:
                  min      max
animal_type                  
European bison  24.0  1100.00
hedgehog         0.4     1.20
lynx            11.0   171.00
red squirrel     0.2     0.36

Length:
                   min    max
animal_type                 
European bison  100.0  350.0
hedgehog         11.0  151.0
lynx             52.0  131.0
red squirrel     14.0   24.0


Lynx Weight:
      weight_kg  body_length_cm
364      171.0            75.0

Hedgehog Length:
      weight_kg  body_length_cm
430        0.9           151.0

Decisions Taken

  1. First point of investigation was the min and max values of the latitude and longitude of each country to ensure there were no extreme outliers. This led to 2 outliers of 'longitude' values in Hungary. Therefore, these instances were converted to NaN to be imputed with the mean value in the null imputation section.
  2. Second point of investigation was the min and max values of the weights and lengths of each animal to ensure there were no extreme outliers. For this section, research had to be performed to ensure educated outlier identification. This led to a 'weight_kg' outlier for the lynx claiming a weight of 171kg, when on average the largest lynx grows to 30kg, as well as a 'body_length_cm' outlier for the hedgehog claiming a length of 151cm, when on average the hedgehog grows to about 35cm at most. Therefore, these instances were converted to NaN to be imputed with the median value in the null imputation section.
  3. Although the European bison had an outlier of 1100kg, this is right within the range of what is naturally possible, hence why it was left.
NaT:  43

Decision Taken

  • Objects which did not fit the datetime format where stored as NaT (Not a Time). These were kept within the dataset in order to show the invalid date without disrupting the datetime data type of 'observation_date'.
Nulls:
 animal_type          0
country              0
weight_kg            0
body_length_cm       0
gender               0
latitude             0
longitude            0
animal_name          0
observation_date    39
data_compiled_by     0
dtype: int64

Decisions Taken

  1. 'animal_type' is the primary feature, therefore any record with a null primary feature was dropped.
  2. Any record missing 'country', 'latitude', AND 'longitude' was dropped as there is no way to pinpoint one without the other.
  3. Fields 'gender' and 'animal_name' were filled with values indicating a lack of input.
  4. Field 'observation_date' was filled with NaT value as to indicate invalid date while preserving datetime object for potential computation.
  5. Used mean average for latitude and longitude due to the low variance between the coordinates of a country.
  6. Used median average for weight and length due to the possibility of outliers skewing mean data.
Duplicates:
 0

Decision Taken

  • The probability of an animal of the same type, weight, and body length being found at the exact (to the 6d.p.) latitude and longitude on the same day is practically 0. Therefore, any duplicates with such characteristics were dropped.
        animal_type  country  weight_kg  body_length_cm          gender  \
0    European bison   Poland    930.000           335.0            male   
1    European bison   Poland    909.000           311.0  not determined   
2    European bison   Poland    581.000           277.0          female   
3    European bison   Poland    900.000           295.0            male   
4    European bison   Poland    620.000           250.0          female   
..              ...      ...        ...             ...             ...   
818        hedgehog  Hungary      0.900            23.0            male   
819    red squirrel   Poland      0.346            20.0          female   
820        hedgehog  Germany      1.000            23.0          female   
821        hedgehog  Germany      0.500            17.0          female   
822        hedgehog  Hungary      0.900            16.0          female   

      latitude  longitude animal_name observation_date data_compiled_by  
0    52.828845  23.820144       Szefu       2024-03-01     Anne Anthony  
1    52.830509  23.826849     unnamed       2024-03-01     Anne Anthony  
2    52.834109  23.807093     unnamed       2024-03-01     Anne Anthony  
3    52.834759  23.817201     unnamed       2024-03-01     Anne Anthony  
4    52.834960  23.820750     unnamed       2024-03-01     Anne Anthony  
..         ...        ...         ...              ...              ...  
818  47.510055  18.944356     unnamed              NaT     Anne Anthony  
819  52.212001  21.033187        Lola              NaT     Anne Anthony  
820  49.561356  11.105334     unnamed              NaT       Bob Bobson  
821  49.561569  11.087046     unnamed              NaT       Bob Bobson  
822  47.509860  18.939943     unnamed              NaT     Anne Anthony  

[823 rows x 10 columns]

==============================================

Data Types:
 animal_type                 object
country                     object
weight_kg                  float64
body_length_cm             float64
gender                      object
latitude                   float64
longitude                  float64
animal_name                 object
observation_date    datetime64[ns]
data_compiled_by            object
dtype: object

==============================================

Nulls:
 animal_type          0
country              0
weight_kg            0
body_length_cm       0
gender               0
latitude             0
longitude            0
animal_name          0
observation_date    32
data_compiled_by     0
dtype: int64

==============================================

Duplicates:
 0

==============================================

         weight_kg  body_length_cm    latitude   longitude
count   823.000000      823.000000  823.000000  823.000000
mean     47.285552       42.771567   49.750522   17.120718
std     169.788872       63.212778    1.772724    4.045733
min       0.200000       11.000000   47.316383   11.074008
25%       0.299000       19.000000   48.226694   14.179944
50%       0.400000       21.000000   49.561356   17.089464
75%       1.000000       23.000000   50.881402   20.562948
max    1100.000000      350.000000   52.853843   23.919668

Data Cleaning Result Analysis

  1. Redundant column has been removed.
  2. Consistent naming conventions.
  3. Consistent datetime formatting with use of datetime object.
  4. No nulls except for the expected NaT nulls which were kept on purpose.
  5. No duplicates.
  6. Significant improvement in dataset statistics.
No description has been provided for this image

Visualisation Analysis

  1. The smaller the animal, the lower the variance between its min and max points. Hence why the data is seen as columns for the smaller animals, but spreads apart as the animal gets larger. This is caused by the 'body_length_cm' entries being rounded to 0 d.p., a point of improvement for any future data taking.
  2. All outliers which can be seen are within natural limits. Grand outliers are likely big specimen of that species, while small outliers are likely babies.

Final Conclusions

  1. Data Volume: 823 validated rows (18.6% reduction from the raw dataset of 1011 rows)
  2. Coordinate Constraint Validation: Achieved a 100% accuracy by eliminating instances of longitudinal drift.
  3. Schema Standardisation: Data fields such as 'animal_type' were not cast to 'category' to ensure no memory overhead is caused by the implementation of new animals.
  4. Precision Analysis: Identified a quantisation bias caused by the 'body_length_cm' being rounded to the nearest cm (0 d.p.) - negligable for large species, causes vertical stacking in smaller species.
  5. Temporal Handling: 32 missing timestamps (3.9% of timestamps) were preserved as NaT (Not a Time) to preserve datatype while not imputing fake values.