|
楼主 |
发表于 2021-1-25 10:15:33
|
显示全部楼层
[code]Ex3 - Getting and Knowing your Data
Check out Occupation Exercises Video Tutorial to watch a data scientist go through the exercises
This time we are going to pull data directly from the internet. Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.
Step 1. Import the necessary libraries
In [39]:
import pandas as pd
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called users and use the 'user_id' as index
In [40]:
users = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',
sep='|', index_col='user_id')
Step 4. See the first 25 entries
In [41]:
users.head(25)
Out[41]:
age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
6 42 M executive 98101
7 57 M administrator 91344
8 36 M administrator 05201
9 29 M student 01002
10 53 M lawyer 90703
11 39 F other 30329
12 28 F other 06405
13 47 M educator 29206
14 45 M scientist 55106
15 49 F educator 97301
16 21 M entertainment 10309
17 30 M programmer 06355
18 35 F other 37212
19 40 M librarian 02138
20 42 F homemaker 95660
21 26 M writer 30068
22 25 M writer 40206
23 30 F artist 48197
24 21 F artist 94533
25 39 M engineer 55107
Step 5. See the last 10 entries
In [42]:
users.tail(10)
Out[42]:
age gender occupation zip_code
user_id
934 61 M engineer 22902
935 42 M doctor 66221
936 24 M other 32789
937 48 M educator 98072
938 38 F technician 55038
939 26 F student 33319
940 32 M administrator 02215
941 20 M student 97229
942 48 F librarian 78209
943 22 M student 77841
Step 6. What is the number of observations in the dataset?
In [43]:
users.shape[0]
Out[43]:
943
Step 7. What is the number of columns in the dataset?
In [44]:
users.shape[1]
Out[44]:
4
Step 8. Print the name of all the columns.
In [45]:
users.columns
Out[45]:
Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')
Step 9. How is the dataset indexed?
In [46]:
# "the index" (aka "the labels")
users.index
Out[46]:
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
...
934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
dtype='int64', name='user_id', length=943)
Step 10. What is the data type of each column?
In [47]:
users.dtypes
Out[47]:
age int64
gender object
occupation object
zip_code object
dtype: object
Step 11. Print only the occupation column
In [48]:
users.occupation
#or
users['occupation']
Out[48]:
user_id
1 technician
2 other
3 writer
4 technician
5 other
6 executive
.....写不下了,后面是最后显示的
943 student
Name: occupation, Length: 943, dtype: object |
|