Many of my university students ask me: hey profe!, what kind of work does a data scientist do? Well it's not as exciting as it might seem to someone who relates it strictly to artificial intelligence.
I tell them that in general one begins with a question on a certain topic, such as:
Is there a correlation between the frequency of cycling and the temperature?
That's what this first publication is about, which will put you in the shoes of a data scientist.
Step One: Find the Data It is convenient that the search is carried out on official sites that give us confidence about the data acquired. In this case we have taken data from the official big data site of the city of Buenos Aires (https://data.buenosaires.gob.ar/dataset/bicicletas-publicas), we will work with the year 2018.
To obtain the daily temperatures we will obtained the data from the site meteostat.net (https://meteostat.net/es/station/87585?t=2018-01-01/2018-12-31); in both cases files with csv format were obtained.
Once downloaded data, I have uploaded them to my google drive and then start with the data analysis in the Google Colab platform (colab.research.google.com), which gives us the possibility of working with jupyter notebooks (jupyter.org) and having a large number of packages preloaded, necessary for data science.
Step two: Prepare the data (clean, format, sort, filter, join, etc) Let's start with the Python code, obviously we will use the Pandas (pandas.pydata.org) library to create and manage our dataframes.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib notebook
# google drive stuff
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive/MyDrive
temp2018 = pd.read_csv('./BA-temp-2018.csv')
bike_ride_2018 = pd.read_csv('./BA-bikes-rides-2018.csv')
# inspect the data
bike_ride_2018.head()
# reset the index to date in both dataframes, later join them
temp2018['date2'] = pd.to_datetime(temp2018['date']).dt.date
temp2018.set_index('date2', inplace=True)
bike_ride_2018.rename(columns={'fecha_origen_recorrido':'dateTime'}, inplace=True)
bike_ride_2018['date'] = pd.to_datetime(bike_ride_2018['dateTime']).dt.date
bike_ride_2018.set_index('date', inplace=True)
ok, so far we have imported the packages that we will use, we enter our google drive and retrieve both datasets. We inspected the data they contained and their type, we created an index by date in both so that we could later join them. What we will look for is to have a dataframe that has for each day of the year, the average temperature and the total time of the daily trips that our cyclists have made. In this way we can then graph both data and see if there is some kind of relationship. So let's keep coding
# put -1 in NaN values
bike_ride_2018 = bike_ride_2018.fillna(-1)
# create new dataframe with date index and only a column for travel duration
bikeRides = bike_ride_2018[['duracion_recorrido']]
bikeRides['duration'] = pd.to_timedelta(bikeRides['duracion_recorrido'])
bikeRides.drop(columns=['duracion_recorrido'], inplace=True)
# a new dataframe to group individual bike ride duration for each date
bikeRidesDay = bikeRides.groupby(['date']).sum()
bikeRidesDay
So far we obtained the daily sum of time consumed by our cyclists, but being a timedelta type, we have days, hours, etc.
For our purpose of graphing the value, we should obtain a single measure of time that is a numerical value (Hours)
# create new float column 'hs_day'
def create_total_hs():
bikeRidesDay['hs_day'] = 0.0
for b in bikeRidesDay.index:
bikeRidesDay['hs_day'][b] = ((bikeRidesDay['duration'][b]).days)*24 + ((bikeRidesDay['duration'][b]).seconds)/3600
create_total_hs()
# join both dataframes
df = bikeRidesXdia.join(tempAvg)
Step three: Graph the results
Finally we will use some graphical tool that allow us to visualize our data to help us draw conclusions. For this purpose we will use the seaborn package (seaborn.pydata.org)
# Graph results
g = sns.lmplot(x='tavg', y='hs_day', data=df)
g = (g.set_axis_labels("Daily avg. temperature (grades)", "Daily Bike rides hours"))
ax = plt.gca()
ax.set_title("Bike rides/Avg.Temperature correlation in Buenos Aires, 2018")
Conclusions
In the graph we can see that there is a relationship between both features (daily avg. temperature vs bike rides in hours). Although there is a lot of dispersion, it can be seen that as the temperature increases, the number of hours decreases. For example, above 23 degrees the number of hours is in the range 1500-3000 hs, while for temperatures below 15 degrees we find a greater range of hs (3000-4500).
To download the jupyter nb: https://github.com/jrercoli/gs_ds_bike_ride
Ok folks, in future posts we will continue to analyze different features of our bike ride dataframe, using other analysis and visualization libraries applied to data science. Your comments are appreciated
Comentarios