How do I get the data?

When all members of a team have signed the NDA, they will get instructions to set up the environment that provides JupyterHub with PySpark and Python3 kernels. On the day of the data release, the teams will have access to the data within that environment.

If you were not able to download the data and the link expired, send an email to TracHackAdmin@tracfone.com requesting the data link.

Can I download the data and use my local machine for the competition?

No, in order to keep things fair, we are providing each team the same machine and environment in which to design, develop and submit their solutions. The environment will allow you to access the shell and install any python packages that you need using pip.  

Is there any PII data in the dataset?

The dataset is scrubbed, transformed, and selected in a way to protect our customer’s data and identity. The dataset contains: 

  • No PII data 
  • No mobile number; replaced with a randomly generated ID without a way for TracFone to go back to the original customer number  
  • No geo-locations
  • No device identifiers
  • Daily aggregate service utilization aggregations 
  • No call detail records of who called who and at what time (even our data scientists don’t have access to that)
  • There is further noise added in the records that shift the data from original source data but does not compromise machine learning use cases

Besides this, we have enforced a sampling strategy to ensure further anonymity. For example, each zip code that we sample from has no less than 1,000 records making it difficult to explicitly narrow down to someone within a community. TracFone data scientists have no way to going back to the original customer identities from the dataset used in this competition.

Can I use the data after the Challenge is done?  

No.  You cannot use any TracFone-supplied data for any purpose other than participation in the Challenge. Any other use is strictly forbidden. In fact, as part of the entry process, each team member will be required to sign a non-disclosure agreement (NDA) expressly agreeing to this restriction. Further, you understand that the data sets supplied by TracFone will be based upon anonymous and deidentified customers, and you expressly agree you will not make any attempt – and will not authorize or assist anyone else to make any attempt – to identify any actual persons from the anonymized data.    

Who can participate? 

Students enrolled at our partner universities – University of Miami, University of Navarra and University of New South Wales. Furthermore, participants must be 18 years of age or older.  However, a student is not eligible if he/she is currently employed by TracFone, or is a member of the immediate family or household of anyone who is (a) a judge of the Challenge, (b) an officer director, or employee of TracFone, or (c) directly involved in the creation or administration of the Challenge.

Do I need to have a team? 

The only restriction is that teams do not exceed five (5) members. If you want to go it alone, you can, but working as a  team can be very helpful to split the workload and try different  strategies to solve the problem.  Each participant can either work alone or be part of a single team; he or she cannot be part of multiple teams for each event.

Can multiple teams work together? 

No.  Each team must work independently. At the end of the competition the winning teams (e.g. top 3) will present their solution for everyone to learn from.

How do I submit my solution?  

See the Submissions page for details, but in short you may make daily submissions of your predictions that our automated jobs will evaluate to give you a score the next morning.

In the real world data science solutions need to make their way into production and automated jobs that run on a schedule or event. That means data scientists need to be able to produce the ‘recipe’ (i.e. code) and not just the ‘cake’ (i.e. model or predictions). We also want to track and version control the solutions to collaborate across multiple teams.  

This is why the top 5 teams will be invited to submit their code at the end of the competition for us to review and verify that their solution is readable and reproducible. If we are unable to verify and replicate a team’s final submission, then they are disqualified.

How are solutions evaluated? 

Every night the latest submission from each of the teams will be evaluated via the F1 Score. We will publish this score for each of the teams on a leaderboard the following morning that will allow the teams to get a sense of how they are doing and take the week’s leader jersey. (Actually, there is no jersey, but definitely bragging rights). 

At the end of the Challenge, the final version that is submitted is evaluated, and the winners are announced at an award ceremony and presentations.  

Who is sponsoring TracHack? 

TracHack is sponsored and organized by TracFone Wireless, Inc. in collaboration with University of Miami, University of Navarra and University of New South Wales.

Help I cannot connect to the environment?

If you see an error message like “The security token included in the request is invalid.” then it means that your terminal does not have the credentials set in the environment variables. You need to make sure you set (on Windows) or export (on Mac/Linux) the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

If that still doesn’t work, then make sure you want to check that there isn’t an extra space when you set/export the environment variables (e.g. AWS_SECRET_ACCESS_KEY=XXXXXXXX)

Still not working? Restart your computer/laptop and start the terminal again, remembering to set the environment variables again.

No luck still? Send an email to TracHackAdmin@tracfone.com with your team name.

Error “An error occurred when calling StartSession operation:”

Looks like your machine has gone down. Please email TracHackAdmin@tracfone.com with your team name and we’ll create a new machine and send you new credentials.

Error “An error was encountered:

Invalid status code '404' from http://XXXXXXX with error payload: {"msg":"Session '1' not found."}

Kindly shutdown the Jupyter Notbook and restart.

Error “If you are facing issues with pandas to write to s3(your jupyter directory)”

Error Below

Kindly restart the notebook.

Help, my machine keeps going down. What is going wrong?

There can be a few reasons why your machine looses connectivity, but typically these are connected with the resource utilization (RAM vs CPU) on the same machine. Remember these are 4 CPU machines with 16GB RAM and 100GB of Elastic Block Storage. If all the compute cores and RAM are being used it leaves no capacity for network IO and leads to the machines becoming defunct.

Here are a few patterns we have noticed that cause these issues:

  1. Running multiple active kernels (or notebooks) at the same time. Try to keep only one to two kernels active at a point in time. Ensure all members of the team are not trying to use the kernel at the same time. Alternatively, make sure you shut down your kernel when you close your notebook. You can also list “Running” kernels Jupyter Hub to identify how many kernels are running at any point in time.
  2. Doing multiple complex operations (e.g., Grid search, data prep, model training, etc.) in a single cell of the notebook. Try to break up the complex code into multiple cells.
  3. Doing heavy in-memory computation that causes memory pressures. Save intermediate datasets to disk/S3 periodically.

Since we want to ensure teams are competing on data science skills rather than access to compute resources, we are requiring every team to use the similar compute constraints.

How to reboot my machine if I encounter the following error “”An error occurred (TargetNotConnected) when calling the StartSession operation: is not connected””

Contact us and we will reboot your server or cluster.

Final Submission Guidelines

1. Submit your predictions as a submission CSV file: Teams will produce a submission with filename: yyyy-mm-dd-final.csv to be saved in the submission folder.

2. There is a folder called code in your Jupyter home directory.

3. Consolidate all of your code (data prep, feature selection, model training and prediction, etc) into a single jupyter notebook and call it mlcode.ipynb. Save this inside the code folder. It is VERY important that your code reproduces the submission. We will use that notebook to reproduce your submission predictions. If these don’t match, then your submission will not be considered valid.

If you are facing issues writing s3 ? You can try the following options :-

Option 1 : Execute the following command in a new cell to install specific version of pandas. Since pip repo is broken this sometimes doesn’t work.
!pip install pandas==1.0.3

Option 2 : Using Boto3 write the local file to S3

Step1 :  !pip install boto3
Step2 : dataframe.to_csv("Test.csv")
Step3   import boto3
        s3 = boto3.resource('s3') 
        #Kindly fill the team name
        teamname = ''   
        s3.meta.client.upload_file('Test.csv', 'tf-trachack-notebooks', teamname+'/jupyter/jovyan/')
Can we upload new third party public data that we may consider complements to the provided data?

Yes, you can use third party data as long it is not restricted for commercial use and is less than 1 GB.

I have more questions. Who can I ask? 

For any other questions, feel free to email the Data Science team at TracHackAdmin@tracfone.com