Introduction
Project Goal
This project aims to implement a multi-stage cleaning workflow that transforms a messy, unstructured dataset by performing fuzzy string matching, null-value imputation, and data structure specification.
Animal type Country Weight kg Body Length cm Gender \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 European bison Poland 930.000 335.0 male
3 European bison Poland 909.000 311.0 not determined
4 European bisonâ„¢ Poland 581.000 277.0 female
... ... ... ... ... ...
1006 red squirrel Poland 0.346 20.0 female
1007 hedgehog Germany 1.000 23.0 female
1008 hedgehog Germany 0.500 17.0 female
1009 red squirrel Poland 0.346 20.0 female
1010 hedgehog Hungary 0.900 16.0 female
Animal code Latitude Longitude Animal name Observation date \
0 NaN NaN NaN NaN 03.01.2024
1 NaN NaN NaN NaN 03.02.2024
2 NaN 52.828845 23.820144 Szefu 01.03.2024
3 NaN 52.830509 23.826849 NaN 01.03.2024
4 NaN 52.834109 23.807093 NaN 01.03.2024
... ... ... ... ... ...
1006 NaN 52.212001 21.033187 Lola 7 May 2024
1007 NaN 49.561356 11.105334 NaN 7 May 2024
1008 NaN 49.561569 11.087046 NaN 7 May 2024
1009 NaN 52.212001 21.033187 Lola 7 May 2024
1010 NaN 47.509860 18.939943 NaN 8 May 2024
Data compiled by
0 James Johnson
1 James Johnson
2 Anne Anthony
3 Anne Anthony
4 Anne Anthony
... ...
1006 Anne Anthony
1007 Bob Bobson
1008 Bob Bobson
1009 Anne Anthony
1010 Anne Anthony
[1011 rows x 11 columns]
==============================================
Data Types:
Animal type object
Country object
Weight kg float64
Body Length cm float64
Gender object
Animal code float64
Latitude float64
Longitude float64
Animal name object
Observation date object
Data compiled by object
dtype: object
==============================================
Nulls:
Animal type 20
Country 12
Weight kg 27
Body Length cm 27
Gender 19
Animal code 1011
Latitude 98
Longitude 98
Animal name 959
Observation date 0
Data compiled by 0
dtype: int64
==============================================
Duplicates:
167
==============================================
Weight kg Body Length cm Animal code Latitude Longitude
count 984.000000 984.000000 0.0 913.000000 913.000000
mean 39.745503 39.107724 NaN 49.393369 18.203280
std 156.290076 58.628601 NaN 7.168900 3.899601
min -0.252000 -19.000000 NaN -78.582973 11.074008
25% 0.293000 19.000000 NaN 48.186913 14.384559
50% 0.331500 21.000000 NaN 49.560723 18.944015
75% 0.800000 23.000000 NaN 52.212433 21.033243
max 1100.000000 350.000000 NaN 52.853843 34.896734
BEFORE:
animal_type : [nan 'European bison' 'European bisonâ„¢' 'European bisson'
'European buster' 'lynx' 'lynx?' 'red squirel' 'red squirrel'
'red squirrell' 'hedgehog' 'wedgehod' 'ledgehod']
country : [nan 'Poland' 'Hungary' 'PL' 'Germany' 'Slovakia' 'Czech Republic' 'Czech'
'Hungry' 'HU' 'Austria' 'Australia' 'CZ' 'DE' 'CC']
gender : [nan 'male' 'not determined' 'female']
animal_name : [nan 'Szefu' 'Basia' 'Lolek' 'Lola' 'Bob Bobson' 'Klucha' 'Zuzia' 'Bibi'
'Sissi' 'Puchatek']
data_compiled_by : ['James Johnson' 'Anne Anthony' 'Bob Bobson' 'John Johnson']
==============================================
Austria
animal_type country weight_kg body_length_cm gender latitude \
164 NaN Austria 0.263 22.0 male NaN
296 hedgehog Austria 0.700 19.0 female 47.781139
297 hedgehog Austria 0.800 20.0 male 47.781828
298 hedgehog Austria 0.600 14.0 female 47.781060
299 hedgehog Austria 0.700 24.0 male 47.780872
.. ... ... ... ... ... ...
774 hedgehog Austria 0.900 25.0 male 47.317974
775 hedgehog Austria 1.100 22.0 male 47.316562
776 hedgehog Austria 1.200 24.0 male 47.318191
803 lynx Austria 20.000 90.0 male 47.781188
804 lynx Austria 26.000 91.0 female 47.780999
longitude animal_name observation_date data_compiled_by
164 NaN NaN 05.03.2024 Bob Bobson
296 14.382004 NaN 14.03.2024 Bob Bobson
297 14.385001 NaN 14.03.2024 Bob Bobson
298 14.386807 NaN 14.03.2024 Bob Bobson
299 14.384460 NaN 14.03.2024 Bob Bobson
.. ... ... ... ...
774 11.187472 NaN 19.04.2024 Bob Bobson
775 11.191160 NaN 19.04.2024 Bob Bobson
776 11.189484 NaN 19.04.2024 Bob Bobson
803 14.385356 NaN 24.04.2024 Bob Bobson
804 14.382120 NaN 24.04.2024 Bob Bobson
[74 rows x 10 columns]
Australia
animal_type country weight_kg body_length_cm gender \
295 hedgehog Australia 0.800 24.0 not determined
370 hedgehog Australia 0.600 14.0 female
583 red squirrel Australia 0.266 20.0 female
688 red squirrel Australia 0.310 23.0 male
latitude longitude animal_name observation_date data_compiled_by
295 47.781689 14.386777 NaN 14.03.2024 Bob Bobson
370 47.781248 14.382649 NaN 21.03.2024 Bob Bobson
583 47.781632 14.382365 NaN 05.04.2024 Bob Bobson
688 47.856092 15.083547 NaN 11.04.2024 Bob Bobson
==============================================
AFTER:
animal_type : [nan 'European bison' 'lynx' 'red squirrel' 'hedgehog']
country : [nan 'Poland' 'Hungary' 'Germany' 'Slovakia' 'Czech Republic' 'Austria']
gender : [nan 'male' 'not determined' 'female']
animal_name : [nan 'Szefu' 'Basia' 'Lolek' 'Lola' 'Bob Bobson' 'Klucha' 'Zuzia' 'Bibi'
'Sissi' 'Puchatek']
data_compiled_by : ['James Johnson' 'Anne Anthony' 'Bob Bobson' 'John Johnson']
Decisions Taken
- Made an assumption that all bisons within the dataset were instances of 'European bison', given the lack of acknowledgement of the American bison.
- Making an assumption of 'Austria' or 'Australia' needing to be mapped is an instance of data corruption. After investigating instances of both countries, I found out that the Australia data entries contained the coordinates of Austria. This led me to make the decision to replace 'Australia' with 'Austria'.
- There was no need to map Slovakia as it had no other representations.
- No other categorical datatypes were mapped given that they were either genders (which had no anomolies) or names.
Negative Weight Instances: 5 Negative Length Instances: 6 Negative Weight and Length Instances: 0 ============================================== [ nan 53. 49. 52. 48. 51. 50. -2. 47. -79.]
Decisions Taken
- My first assumption was that both weight and height of the same entry might be negative due to a possible measurement system error. After verifying, these results were turned into NaN to be handled during null imputation.
- I ensured that the anomolous latitude values (-2, -79) were not close to any other values in order to confirm anomolous entries. These results were turned into NaN to be handled during null imputation.
Latitude:
min max
country
Austria 47.316383 47.857586
Czech Republic 49.937340 50.884288
Germany 48.160432 52.588834
Hungary 47.505768 48.511532
Poland 50.041119 52.853843
Slovakia 48.183848 49.057758
Longitude:
min max
country
Austria 11.184790 15.083547
Czech Republic 14.177877 14.850140
Germany 11.074008 13.152811
Hungary 18.939062 34.896734
Poland 19.949320 23.919668
Slovakia 17.065508 22.425527
Hungary:
animal_type country weight_kg body_length_cm gender latitude \
202 red squirrel Hungary 0.296 19.0 male NaN
213 red squirrel Hungary 0.296 19.0 male NaN
253 red squirrel Hungary 0.282 21.0 male NaN
263 red squirrel Hungary 0.282 21.0 male NaN
longitude animal_name observation_date data_compiled_by
202 34.896734 NaN 07.03.2024 Anne Anthony
213 34.896734 NaN 07.03.2024 Anne Anthony
253 34.896734 NaN 11.03.2024 Anne Anthony
263 34.896734 NaN 11.03.2024 Anne Anthony
==============================================
Weight:
min max
animal_type
European bison 24.0 1100.00
hedgehog 0.4 1.20
lynx 11.0 171.00
red squirrel 0.2 0.36
Length:
min max
animal_type
European bison 100.0 350.0
hedgehog 11.0 151.0
lynx 52.0 131.0
red squirrel 14.0 24.0
Lynx Weight:
weight_kg body_length_cm
364 171.0 75.0
Hedgehog Length:
weight_kg body_length_cm
430 0.9 151.0
Decisions Taken
- First point of investigation was the min and max values of the latitude and longitude of each country to ensure there were no extreme outliers. This led to 2 outliers of 'longitude' values in Hungary. Therefore, these instances were converted to NaN to be imputed with the mean value in the null imputation section.
- Second point of investigation was the min and max values of the weights and lengths of each animal to ensure there were no extreme outliers. For this section, research had to be performed to ensure educated outlier identification. This led to a 'weight_kg' outlier for the lynx claiming a weight of 171kg, when on average the largest lynx grows to 30kg, as well as a 'body_length_cm' outlier for the hedgehog claiming a length of 151cm, when on average the hedgehog grows to about 35cm at most. Therefore, these instances were converted to NaN to be imputed with the median value in the null imputation section.
- Although the European bison had an outlier of 1100kg, this is right within the range of what is naturally possible, hence why it was left.
NaT: 43
Decision Taken
- Objects which did not fit the datetime format where stored as NaT (Not a Time). These were kept within the dataset in order to show the invalid date without disrupting the datetime data type of 'observation_date'.
Nulls: animal_type 0 country 0 weight_kg 0 body_length_cm 0 gender 0 latitude 0 longitude 0 animal_name 0 observation_date 39 data_compiled_by 0 dtype: int64
Decisions Taken
- 'animal_type' is the primary feature, therefore any record with a null primary feature was dropped.
- Any record missing 'country', 'latitude', AND 'longitude' was dropped as there is no way to pinpoint one without the other.
- Fields 'gender' and 'animal_name' were filled with values indicating a lack of input.
- Field 'observation_date' was filled with NaT value as to indicate invalid date while preserving datetime object for potential computation.
- Used mean average for latitude and longitude due to the low variance between the coordinates of a country.
- Used median average for weight and length due to the possibility of outliers skewing mean data.
Duplicates: 0
Decision Taken
- The probability of an animal of the same type, weight, and body length being found at the exact (to the 6d.p.) latitude and longitude on the same day is practically 0. Therefore, any duplicates with such characteristics were dropped.
animal_type country weight_kg body_length_cm gender \
0 European bison Poland 930.000 335.0 male
1 European bison Poland 909.000 311.0 not determined
2 European bison Poland 581.000 277.0 female
3 European bison Poland 900.000 295.0 male
4 European bison Poland 620.000 250.0 female
.. ... ... ... ... ...
818 hedgehog Hungary 0.900 23.0 male
819 red squirrel Poland 0.346 20.0 female
820 hedgehog Germany 1.000 23.0 female
821 hedgehog Germany 0.500 17.0 female
822 hedgehog Hungary 0.900 16.0 female
latitude longitude animal_name observation_date data_compiled_by
0 52.828845 23.820144 Szefu 2024-03-01 Anne Anthony
1 52.830509 23.826849 unnamed 2024-03-01 Anne Anthony
2 52.834109 23.807093 unnamed 2024-03-01 Anne Anthony
3 52.834759 23.817201 unnamed 2024-03-01 Anne Anthony
4 52.834960 23.820750 unnamed 2024-03-01 Anne Anthony
.. ... ... ... ... ...
818 47.510055 18.944356 unnamed NaT Anne Anthony
819 52.212001 21.033187 Lola NaT Anne Anthony
820 49.561356 11.105334 unnamed NaT Bob Bobson
821 49.561569 11.087046 unnamed NaT Bob Bobson
822 47.509860 18.939943 unnamed NaT Anne Anthony
[823 rows x 10 columns]
==============================================
Data Types:
animal_type object
country object
weight_kg float64
body_length_cm float64
gender object
latitude float64
longitude float64
animal_name object
observation_date datetime64[ns]
data_compiled_by object
dtype: object
==============================================
Nulls:
animal_type 0
country 0
weight_kg 0
body_length_cm 0
gender 0
latitude 0
longitude 0
animal_name 0
observation_date 32
data_compiled_by 0
dtype: int64
==============================================
Duplicates:
0
==============================================
weight_kg body_length_cm latitude longitude
count 823.000000 823.000000 823.000000 823.000000
mean 47.285552 42.771567 49.750522 17.120718
std 169.788872 63.212778 1.772724 4.045733
min 0.200000 11.000000 47.316383 11.074008
25% 0.299000 19.000000 48.226694 14.179944
50% 0.400000 21.000000 49.561356 17.089464
75% 1.000000 23.000000 50.881402 20.562948
max 1100.000000 350.000000 52.853843 23.919668
Data Cleaning Result Analysis
- Redundant column has been removed.
- Consistent naming conventions.
- Consistent datetime formatting with use of datetime object.
- No nulls except for the expected NaT nulls which were kept on purpose.
- No duplicates.
- Significant improvement in dataset statistics.
Visualisation Analysis
- The smaller the animal, the lower the variance between its min and max points. Hence why the data is seen as columns for the smaller animals, but spreads apart as the animal gets larger. This is caused by the 'body_length_cm' entries being rounded to 0 d.p., a point of improvement for any future data taking.
- All outliers which can be seen are within natural limits. Grand outliers are likely big specimen of that species, while small outliers are likely babies.
Final Conclusions
- Data Volume: 823 validated rows (18.6% reduction from the raw dataset of 1011 rows)
- Coordinate Constraint Validation: Achieved a 100% accuracy by eliminating instances of longitudinal drift.
- Schema Standardisation: Data fields such as 'animal_type' were not cast to 'category' to ensure no memory overhead is caused by the implementation of new animals.
- Precision Analysis: Identified a quantisation bias caused by the 'body_length_cm' being rounded to the nearest cm (0 d.p.) - negligable for large species, causes vertical stacking in smaller species.
- Temporal Handling: 32 missing timestamps (3.9% of timestamps) were preserved as NaT (Not a Time) to preserve datatype while not imputing fake values.