top of page

Important points to consider while developing a data science project

  • neovijayk
  • Jun 13, 2020
  • 4 min read

In this article we will take a look at important terms that we should know to start with Machine Learning or deep learning projects.

POINT 1: Don’t hurry to apply Machine Learning everywhere

  1. First understand the problem correctly, what we want to solve? How we can solve these problems logically?

  2. Make yourself familiar with the domain knowledge.

  3. Then understand what we have as a data, what features/fields are captured.

  4. Study available data in detail find gaps in it if present any and also gain some insights from it by briefly doing some analysis.

  5. Making familiar with Problems that we want to solve, required Domain knowledge and Data we have, Logical steps, will help us identify tools, algorithms, extra data that we should capture along with existing data if possible to solve the problem.

  6. Many of the times you will find very simple solutions with Statistics and some in-depth analytics without using Machine Learning or Deep Learning can solve the many of the problems in the project

  7. Or we can use simple analytics to derive features which can be used as the input to our machine learning algorithm to yield better performance.

POINT 2: Before jumping to the code clear the logical steps in mind

  1. It will come with experience in coding. Always try to write down logical steps you want to perform on the data first.

  2. How data should be prepared for the analysis – Pre-processing, Derivation of new features if required, or adding or merging different data tables etc

  3. Then find appropriate versions of the tools, packages, functions for the implementation (will require some R & D) of such steps in Python

POINT 3: Pandas Vs Numpy for the computations

  1. As we know execution in Python is slow compared to C and C++. And also in Python if coding is not done properly then it will take lot of time for the computation tasks.

  2. For faster execution of the code we always try to use built in functions and try to avoid loop as much as possible.

  3. If we have lots of mathematical computations in the code with huge data size (lets say more than 1 GB) then try to use Numpy instead of Pandas.

  4. Or you can try to use both of them in combination i.e. do computations in Numpy and convert back to Pandas if Pandas is very necessary in your code.

POINT 4: Avoid using explicit loops, try to use built in functions

  1. It is always recommended to avoid using explicit for loops as much as possible in the coding these are very time consuming in Python

  2. Try to use Vectorization. (to know more about click this link)

  3. Also try to use built in functions such as in Pandas groupby and in Numpy dot product etc.

POINT 5: Always check memory usage and time to execute by Python

  1. Always keep track of memory used by Python and also how much time our code is taking to execute

  2. This is important to remember always keep track of memory used and time taken to execute. And always try to minimise it as much as possible.

POINT 6: Always create versions of code files used in development after certain progress is achieved

  1. This is important and good practice to create versions of code notebook or files as we progress in a project.

  2. As development progresses your code file will become huge and you will do lots of changes in it with time creating versions of code files after every code review of the outputs can save lot of time if you want to jump back to previous version without recreating it.

POINT 7: Always use comments in your code

  1. Comments are important to describe purpose of the notebook, objective of the lines of code or function or classes use in the code.

  2. This will make your code more readable and easy to debut

  3. Always mentions in comments functions input and output as a check list so that after few days if you read it you will know what this function performs and what should be the output of the function written.

  4. Mention assumptions you made about data

  5. Also updates you made and why?

POINT 8: Code review

  1. If you have mentor or senior who can review your work after some development is very important. Otherwise do it your self.

  2. You should be sure about every line of code you used in the project its purpose, what it does, what inputs it requires, what output we want there from it, and whether code as whole generating logically making sense output and what is expected from it.

  3. This is the one of the most important step that we should perform after certain level of development in the project just to make sure every thing is good and working as intended.

  4. Ask lot of relevant questions about lines of code or specific function its role in the code. Asking questions your self without making an assumptions is important to self learn new tools and technologies.

  5. Code reviews may reveal bugs in the code, data or in the logic.

These are some of the important points that will surely help you if you understand and follow it correctly during your data science project.

That’s it for now. If you like this article please like and subscribe to this blog. Thank you 🙂

Comments


Subscribe to BrainStorm newsletter

For notifications on latest posts/blogs

Thanks for submitting!

  • Twitter
  • Facebook
  • Linkedin

© 2023 by my-learnings.   Copy rights Vijay@my-learnings.com

bottom of page